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The application of Bayesian methods in cosmology and astrophysics has flourished over the past decade, spurred 
by data sets of increasing size and complexity. In many respects, Bayesian methods have proven to be vastly superior 
to more traditional statistical tools, offering the advantage of higher efficiency and of a consistent conceptual basis 
for dealing with the problem of induction in the presence of uncertainty. This trend is likely to continue in the future, 
when the way we collect, manipulate and analyse observations and compare them with theoretical models will assume 
an even more central role in cosmology. 

This review is an introduction to Bayesian methods in cosmology and astrophysics and recent results in the field. 
I first present Bayesian probability theory and its conceptual underpinnings, Bayes' Theorem and the role of priors, 
i-i^ i I discuss the problem of parameter inference and its general solution, along with numerical techniques such as Monte 

Oh Carlo Markov Chain methods. I then review the theory and application of Bayesian model comparison, discussing 

the notions of Bayesian evidence and effective model complexity, and how to compute and interpret those quantities, 
w ' Recent developments in cosmological parameter extraction and Bayesian cosmological model building are summarized, 

highlighting the challenges that lie ahead. 
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At first glance, it might appear surprising that a trivial mathematical result obtained by an obscure 
minister over 200 hundred years ago ought still to excite so much interest across so many disciplines, 
OO ! from econometrics to biostatistics, from financial risk analysis to cosmology. Published posthumously 
■ thanks to Richard Price in 1763, "An essay towards solving a problem in the doctrine of chances" by the 
, rev. Thomas Bayes (1701(?)-1761) [1] had nothing in it that could herald the growing importance and 
enormous domain of application that the subject of Bayesian probability theory would acquire more than 
two centuries afterwards. However, upon reflection there is a very good reason why Bayesian methods 
& \ are undoubtedly on the rise in this particular historical epoch: the exponential increase in computational 
power of the last few decades made massive numerical inference feasible for the first time, thus opening 
the door to the exploitation of the power and flexibility of a rich set of Bayesian tools. Thanks to fast and 
cheap computing machines, previously unsolvable inference problems became tractable, and algorithms for 
numerical simulation flourished almost overnight. 

Historically, the connections between physics and Bayesian statistics have always been very strong. 
Many ideas were developed because of related physical problems, and physicists made several distinguished 
contributions. One has only to think of people like Laplace, Bernouilli, Gauss, Metropolis, Jeffreys, etc. 
Cosmology is perhaps among the latest disciplines to have embraced Bayesian methods, a development 
mainly driven by the data explosion of the last decade, as Figure [T] indicates. However, motivated by 
difficult and computationally intensive inference problems, cosmologists are increasingly coming up with 
new solutions that add to the richness of a growing Bayesian literature. 
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Some cosmologists are sceptic regarding the usefulness of employing more advanced statistical methods, 
perhaps because they think with Mark Twain that there are "lies, damned lies and statistics". One 
argument that is often heard is that there is no point in bothering too much about refined statistical 
analyses, as better data will in the future resolve the question one way or another, be it the nature of 
dark energy or the initial conditions of the Universe. I strongly disagree with this view, and would instead 
argue that sophisticated statistical tools will be increasingly central for modern cosmology. This opinion 
is motivated by the following reasons: 

(i) The complexity of the modelling of both our theories and observations will always increase, thus 
requiring correspondingly more refined statistical and data analysis skills. In fact, the scientific return 
of the next generation of surveys will be limited by the level of sophistication and efficiency of our 
inference tools. 

(ii) The discovery zone for new physics is when a potentially new effect is seen at the 3-4 a level. This is 
when tantalizing suggestion for an effect starts to accumulate but there is no firm evidence yet. In this 
potential discovery region a careful application of statistics can make the difference between claiming 
or missing a new discovery. 

(iii) If you are a theoretician, you do not want to waste your time trying to explain an effect that is not 
there in the first place. A better appreciation of the interpretation of statistical statements might help 
in identifying robust claims from spurious ones. 

(iv) Limited resources mean that we need to focus our efforts on the most promising avenues. Experiment 
forecast and optimization will increasingly become prominent as we need to use all of our current 
knowledge {and the associated uncertainty) to identify the observations and strategies that are likely 
to give the highest scientific return in a given field. 

(v) Sometimes there will be no better data! This is the case for the many problems associated with cosmic 
variance limited measurements on large scales, for example in the cosmic background radiation, where 
the small number of independent directions on the sky makes it impossible to reduce the error below 
a certain level. 

This review focuses on Bayesian methodologies and related issues, presenting some illustrative results 
where appropriate and reviewing the current state-of-the art of Bayesian methods in cosmology. The 
emphasis is on the innovative character of Bayesian tools. The level is introductory, pitched for graduate 
students who are approaching the field for the first time, aiming at bridging the gap between basic textbook 
examples and application to current research. In the last sections we present some more advanced material 
that we hope might be useful for the seasoned practitioner, too. A basic understanding of cosmology and 
of the interplay between theory and cosmological observations (at the level of the introductory chapters 
in [2]) is assumed. A full list of references is provided as a comprehensive guidance to relevant literature 
across disciplines. 

This paper is organized in two main parts. The first part, sections [2HU focuses on probability theory, 
methodological issues and Bayesian methods generally. In section [2] we present the fundamental distinction 
between probability as frequency or as degree of belief, we introduce Bayes' Theorem and discuss the 
meaning and role of priors in Bayesian theory. Section [3] is devoted to Bayesian parameter inference and 
related issues in parameter extraction. Section [4] deals with the topic of Bayesian model comparison from 
a conceptual and technical point of view, covering Occam's razor principle, its practical implementation in 
the form of the Bayesian evidence, the effective number of model parameters and information criteria for 
approximate model comparison. The second part presents applications to cosmological parameter inference 
and related topics (section [5j) and to Bayesian cosmological model building (section [6j) , including multi- 
model inference and model comparison forecasting. Section [7] gives our conclusions. 

2 Bayesian probability theory 

In this section we introduce the basic concepts and the notation we employ. After a discussion of what 
probability is, we turn to the central formula for Bayesian inference, namely Bayes theorem. The whole of 
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Figure 1. The evolution of the B-word: number of articles in astronomy and cosmology with "Bayesian" in the title, as a function of 
publication year. The number of papers employing one form or another of Bayesian methods is of course much larger than that. Up 
until about 1995, Bayesian papers were concerned mostly with image reconstruction techniques, while in subsequent years the domain 
of application grew to include signal processing, parameter extraction, object detection, cosmological model building, decision theory 
and experiment optimization, and much more. It appears that interest in Bayesian statistics began growing around 2002 (source: 

NASA/ADS). 

Bayesian inference follows from this extremely simple cornerstone. We then present some views about the 
meaning of priors and their role in Bayesian theory, an issue which has always been (wrongly) considered 
a weak point of Bayesian statistics. 

There are many excellent textbooks on Bayesian statistics: the works by Sir Harold Jeffreys [3] and 
Bruno de Finetti [4] are classics, while an excellent modern introduction with an extensive reading list 
is given by [5]. A good textbook is [6]. Worth reading as a source of inspiration is the though-provoking 
monograph by E.T. Jaynes [7]. Computational aspects are treated in [8], while MacKay [9] has a quite 
unconventional but inspiring choice of topics with many useful exercices. Two very good textbooks on 
the subject written by physicists are [10,11]. A nice introductory review aimed at physicists is [12] (see 
also [13]). Tom Loredo has some masterfully written introductory material, too [14,15]. A good source 
expanding on many of the topics covered here is Ref. [16]. 

2.1 What is probability? 

2.1.1 Probability as frequency. The classical approach to statistics defines the probability of an event 
as 

"the number of times the event occurs over the total number of trials, in the limit of an infinite series of 
equiprobable repetitions. " 

This is the so-called frequentist school of thought. This definition of probability is however unsatisfactory 
in many respects. 

(i) Strikingly, this definition of probability in terms of relative frequency of outcomes is circular, i.e. it 
assumes that repeated trials have the same probability of outcomes - but it was the the very notion 
of probability that we were trying to define in the first place! 

(ii) It cannot handle with unrepeatable situations, such as the probability that I will be overrun by a 
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car when crossing the street, or, in the cosmological context, questions concerning the properties of 
the observable Universe as a whole, of which we have exactly one sample. Indeed, perfectly legitimate 
questions such as "what is the probability that it was raining in Oxford when William I was crowned?" 
cannot even be formulated in classical statistics, 
(hi) The definition only holds exactly for an infinite sequence of repetitions. In practice we always handle 
with a finite number of measurements, sometimes with actually only a very small number of them. How 
can we assess when "how many repetitions" are sufficient? And what shall we do when we have only 
a handful of repetitions? Frequentist statistics does not say, except sometimes devising complicated 
ad-hockeries to correct for "small sample size" effects. In practice, physicists tend to forget about the 
"infinite series" requirement and use this definitions and the results that go with it (for example, about 
asymptotic distributions of test statistics) for whatever number of samples they happen to be working 
with. 

Another, more subtle aspects has to do with the notion of "randomness" . Restricting ourselves to classical 
(non-chaotic) physical systems for now, let us consider the paradigmatic example of a series of coin tosses. 
From an observed sequence of heads and tails we would like to come up with a statistical statement about 
the fairness of the coin, which is deemed to be "fair" if the probability of getting heads is pn = 0.5. 
At first sight, it might appear plausible that the task is to determine whether the coin possesses some 
physical property (for example, a tensor of intertia symmetric about the plane of the coin) that will 
ensure that the outcome is indifferent with respect to the interchange of heads and tails. As forcefully 
argued by Jaynes [7], however, the probability of the outcome of a sequence of tosses has nothing to 
do with the physical properties of the coin being tested! In fact, a skilled coin-tosser (or a purpose-built 
machine, see [17]) can influence the outcome quite independently of whether the coin is well-balanced (i.e., 
symmetric) or heavily loaded. The key to the outcome is in fact the definition of random toss. In a loose, 
intuitive fashion, we sense that a carefully controlled toss, say in which we are able to set quite precisely 
the spin and speed of the coin, will spoil the "randomness" of the experiment — in fact, we might well call 
it "cheating". However, lacking a precise operational definition of what a "random toss" means, we cannot 
meaningfully talk of the probability of getting heads as of a physical property of the coin itself. It appears 
that the outcome depends on our state of knowledge about the initial conditions of the system (angular 
momentum and velocity of the toss): an lack of precise information about the initial conditions results in 
a state of knowledge of indifference about the possible outcome with respect to the specification of heads 
or tails. If however we insist on defining probability in terms of the outcome of random experiments, we 
immediately get locked up in a circularity when we try to specify what "random" means. For example, 
one could say that 

"a random toss is one for which the sequence of heads and tails is compatible with assuming the hypothesis 
p H = 0.5". 

But the latter statement is exactly what we were trying to test in the first place by using a sequence of 
random tosses! We are back to the problem of circular definition we highlighted above. 

2.1.2 Probability as degree of belief. Many of the limitations above can be avoided and paradoxes 
resolved by taking a Bayesian stance about probabilities. The Bayesian viewpoint is based on the simple 
and intuitive tenet that 

"probability is a measure of the degree of belief about a proposition" . 

It is immediately clear that this definition of probability applies to any event, regardless whether we are 
considering repeated experiments (e.g., what is the probability of obtaining 10 heads in as many tosses of 
a coin?) or one-off situations (e.g., what is the probability that it will rain tomorrow?). Another advantage 
is that it deals with uncertainty independently of its origin, i.e. there is no distinction between "statistical 
uncertainty" coming from the finite precision of the measurement apparatus and the associated random 
noise and "systematic uncertainty", deriving from deterministic effects that are only partially known 
(e.g., calibration uncertainty of a detector). From the coin tossing example above we learn that it makes 
good sense to think of probability as a state of knowledge in presence of partial information and that 
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"randomness" is really a consequence of our lack of information about the exact conditions of the system 
(if we knew the precise way the coin is flipped we could predict the outcome of any toss with certainty. 
The case of quantum probabilities is discussed below). The rule for manipulating states of belief is given 
by Bayes' Theorem, which is introduced in Eq. ([5]) below. 

It seems to us that the above arguments strongly favour the Bayesian view of probability (a more 
detailed discussion can be found in [7,14]). Ultimately, as physicists we might as well take the pragmatic 
view that the approach that yields demonstrably superior results ought to be preferred. In many real-life 
cases, there are several good reasons to prefer a Bayesian viewpoint: 

(i) Classic frequentist methods are often based on asymptotic properties of estimators. Only a handful of 
cases exist that are simple enough to be amenable to analytic treatment (in physical problems one most 
often encounters the Normal and the Poisson distribution). Often, methods based on such distributions 
are employed not because they accurately describe the problem at hand, but because of the lack of 
better tools. This can lead to serious mistakes. Bayesian inference is not concerned by such problems: 
it can be shown that application of Bayes' Theorem recovers frequentist results (in the long run) for 
cases simple enough where such results exist, while remaining applicable to questions that cannot even 
be asked in a frequentist context. 

(ii) Bayesian inference deals effortlessly with nuisance parameters. Those are parameters that have an in- 
fluence on the data but are of no interest for us. For example, a problem commonly encountered in 
astrophysics is the estimation of a signal in the presence of a background rate (see [14,18,19]). The 
particles of interest might be photons, neutrinos or cosmic rays. Measurements of the source s must 
account for uncertainty in the background, described by a nuisance parameter b. The Bayesian proce- 
dure is straightforward: infer the joint probability of s and b and then integrate over the uninteresting 
nuisance parameter b ( "marginalization" , see Eq. (|16p ). Frequentist methods offer no simple way of 
dealing with nuisance parameters (the very name derives from the difficulty of accounting for them in 
classical statistics). However neglecting nuisance parameters or fixing them to their best-fit value can 
result in a very serious underestimation of the uncertainty on the parameters of interest (see [20] for 
an example involving galaxy evolution models). 

(iii) In many situations prior information is highly relevant and omitting it would result in seriously wrong 
inferences. The simplest case is when the parameters of interest have a physical meaning that restricts 
their possible values: masses, count rates, power and light intensity are examples of quantities that 
must be positive. Frequentist procedures based only on the likelihood can give best-fit estimates that 
are negative, and hence meaningless, unless special care is taken (for example, constrained likelihood 
methods). This often happens in the regime of small counts or low signal to noise. The use of Bayes' 
Theorem ensures that relevant prior information is accounted for in the final inference and that phys- 
ically meaningless results are weeded out from the beginning. 

(iv) Bayesian statistics only deals with the data that were actually observed, while frequentist methods focus 
on the distribution of possible data that have not been obtained. As a consequence, frequentist results 
can depend on what the experimenter thinks about the probability of data that have not been observed. 
(this is called the "stopping rule" problem). This state of affairs is obviously absurd. Our inferences 
should not depend on the probability of what could have happened but should be conditional on 
whatever has actually occurred. This is built into Bayesian methods from the beginning since inferences 
are by construction conditional on the observed data. 

However one looks at the question, it is fair to say that the debate among statisticians is far from settled 
(for a discussion geared for physicists, see [21]). Louis Lyons neatly summarized the state of affairs by 
saying that [22] 

"Bayesians address the question everyone is interested in by using assumptions no-one believes, while frequen- 
tists use impeccable logic to deal with an issue of no interest to anyone". 
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2.1.3 Quantum probability. Ultimately the quantum nature of the microscopic world ensures that the 
fundamental meaning of probability is to be found in a theory of quantum measurement. The fundamental 
problem is how to make sense of the indeterminism brought about when the wavefunction collapses into 
one or other of the eigenstates being measured. The classic textbook view is that the probability of each 
outcome is given by the square of the amplitude, the so-called "Born rule" . However the question remains 
— where do quantum probabilities come from? 

Recent developments have scrutinized the scenario of the Everettian many-worlds interpretation of 
quantum mechanics, in which the collapse never happens but rather the world "splits" in disconnected 
"branches" every time a quantum measurement is performed. Although all outcomes actually occur, uncer- 
tainty comes into the picture because an observer is unsure about which outcome will occur in her branch. 
David Deutsch [23] suggested to consider the problem from a decision-theoretic point of view. He proposed 
to consider a formal system in which the probabilities ( "weights" ) to be assigned to each quantum branch 
are expressed in terms of the preferences of a rational agent playing a quantum game. Each outcome of 
the game (i.e., weights assignment) has an associated utility function that determines the payoff of the 
rational agent. Being rational, the agent will act to maximise the expectation value of her utility. The 
claim is that it is possible to find a simple and plausible set or rules for the quantum game that allow 
to derive unique outcomes (i.e., probability assignments) in agreement with the Born rule, independently 
of the utility function chosen, i.e. the payoffs. In such a way, one obtains a bridge between subjective 
probabilities (the weight assignment by the rational agent) and quantum chance (the weights attached to 
the branches according to the Born rule). 

This approach to the problem of quantum measurement remains highly controversial. For an introduction 
and further reading, see e.g. [24, 25] and references therein. An interesting comparison between classical 
and quantum probabilities can be found in [26]. 



2.2 Bayes' Theorem 

Bayes' Theorem is a simple consequence of the axioms of probability theory, giving the rules by which 
probabilities (understood as degree of belief in propositions) should be manipulated. As a mathematical 
statement it is not controversial — what is a matter of debate is whether it should be used as a basis for 
inference and in general for dealing with uncertainty. In the previous section we have given some arguments 
why we strongly believe this to be the case. An important result is that Bayes' Theorem can be derived 
from a set of basic consistency requirements for plausible reasoning, known as Cox axioms [27]. Therefore, 
Bayesian probability theory can be shown to be the unique generalization of boolean logic into a formal 
system to manipulate propositions in the presence of uncertainty [7]. In other words, Bayesian inference 
is the unique generalization of logical deduction when the available information is incomplete. 

We now turn to the presentation of the actual mathematical framework. We adopt a fairly relaxed 
notation, as mathematical rigour is not the aim of this review. For a more formal introduction, see e.g. [3,7]. 

Let us consider a proposition A, which could be a random variable (e.g., the probability of obtaining 12 
when rolling two dices) or a one-off proposition (the probability that Prince Charles will become king in 
2015), and its negation, A. The sum rule reads 



where the vertical bar means that the probability assignment is conditional on assuming whatever infor- 
mation is given on its right. Above, I represents any relevant information that is assumed to be true. The 
product rule is written as 



which says that the joint probability of A and B equals to the probability of A given that B occurs times 
the probability of B occurring on its own (both conditional on information /). If we are interested in the 



p(A\I)+p(A\I) = l 



(1) 



p(A,B\I) =p(A\B,I)p(B\I) 



(2) 
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probability of B alone, irrespective of A, the sum and product rules together imply that 



p(b\i) = J2p(AB\ I), (3) 

A 

where the sum runs over the possible outcomes for proposition A. The quantity on the left-hand-side is 
called marginal probability of B. Since obviously p(A, B\I) = p(B, A\I), the product rule can be rewritten 
to give Bayes' Theorem: 

p (b\a,i) = p ^ A ^2\rj B ^ (Bayes theorem )- ( 4 ) 

The interpretation of this simple result is more illuminating if one replaces for A the observed data al and 
for B the hypothesis H we want to assess, obtaining 

p(d\H,I)p(H\I) 

P{H\d,I) = ^ • (5) 

On the left-hand side, p(H\d, I) is the posterior probability of the hypothesis taking the data into account. 
This is proportional to the sampling distribution of the data p(d\H, I) assuming the hypothesis is true, times 
the prior probability for the hypothesis, p(H\I) ("the prior", conditional on whatever external information 
we have, /), which represents our state of knowledge before seeing the data. The sampling distribution 
encodes how the degree of plausibility of the hypothesis changes when we acquire new data. Considered 
as a function of the hypothesis, for fixed data (the ones that have been observed), it is called the likelihood 
function and we will often employ the shortcut notation C(H) = p(d\H, I). Notice that as a function of the 
hypothesis the likelihood is not a probability distribution. The normalization constant on the right-hand- 
side in the denominator is the marginal likelihood (in cosmology often called the "Bayesian evidence") 
given by 



p{d\I) = ' s ^p{d\H, I)p(H\I) (Bayesian evidence). (6) 

H 

where the sum runs over all the possible outcomes for the hypothesis H . This is the central quantity for 
model comparison purposes, and it is further discussed in section [H The posterior is the relevant quantity 
for Bayesian inference as it represents our state of belief about the hypothesis after we have considered the 
information in the data (hence the name). Notice that there is a logical sequence in going from the prior 
to the posterior, not necessarily a temporal one, i.e. a scientist might well specify the prior after the data 
have been gathered provided that the prior reflects her state of knowledge irrespective of the data. Bayes' 
Theorem is therefore a prescription as to how one learns from experience. It gives a unique rule to update 
one's beliefs in the light of the observed data. 

The need to specify a prior describing a "subjective" state of knowledge has exposed Bayesian inference 
to the criticism that it is not objective, and hence unfit for scientific reasoning. Exactly the contrary is 
true — a thorny issue to which we now briefly turn out attention. 



2.3 Subjectivity, priors and all that 

The prior choice is a fundamental ingredient of Bayesian statistics. Historically, it has been regarded as 
problematic, since the theory does not give guidance about how the prior should be selected. Here we 
argue that this issue has been given undue emphasis and that prior specification should be regarded as a 
feature of Bayesian statistics, rather than a limitation. 

The guiding principle of Bayesian probability theory is that there can be no inference without assump- 
tions, and thus the prior choice ought to reflect as accurately as possible one's assumptions and state of 
knowledge about the problem in question before the data come along. Far from undermining objectivity, 
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this is obviously a positive feature, because Bayes' Theorem gives a univoque procedure to update different 
degrees of beliefs that different scientist might have held before seeing the data. Furthermore, there are 
many cases where prior (i.e., external) information is relevant and it is sensible to include it in the inference 
procedure^. 

It is only natural that two scientists might have different priors as a consequence of their past scientific 
experiences, theoretical outlook and based on the outcome of previous observations they might have 
performed. As long as the prior p(H\I) (where the extra conditioning on / denotes the external information 
of the kind listed above) has a support that is non-zero in regions where the likelihood is large, repeated 
application of Bayes theorem, Eq. ([5]), will lead to a posterior pdf that converges to a common, i.e. objective 
inference on the hypothesis. As an example, consider the case where the inference concerns the value of 
some physical quantity 9, in which case p(9\d,I) has to be interpreted as the posterior probability density 
and p(9\d, I)d9 is the probability of 9 to take on a value between 9 and 9 + d9. Alice and Bob have different 
prior beliefs regarding the possible value of 9, perhaps based on previous, independent measurements of 
that quantity that they have performed. Let us denote their priors by p(9\Ii) (i = A, B) and let us assume 
that they are described by two Gaussian distributions of mean /ij and variance T, 2 , i = A,B representing 
the state of knowledge of Alice and Bob, respectively. Alice and Bob go together in the lab and perform a 
measurement of 9 with an apparatus subject to Gaussian noise of known variance a 2 . They obtain a value 
mi, hence their likelihood is 



CA'))= l>{llll 0) - r.,,exp(-^ (g J l)2 ). (7) 



Replacing the hypothesis H by the continuous variable 9 in Bayes' Theorem^, we obtain for their respective 
posterior pdf's after the new datum 

p{9\m l J l ) = C{e ^\ I ; ) (i = A,B). (8) 
p{mi\Ii) 

It is easy to see that the posterior pdf's of Alice and Bob are again Gaussians with means 

_ _ mi + (a/Y^ifm 

l + ws,) 2 [) 

and variance 

T -=IT(W = (10) 

Thus if the likelihood is more informative than the prior, i.e. for (<r/Sj) <C 1 the posterior means of Bob 
and Alice will converge towards the measured value, mi. As more and more data points are gathered, 
one can simply replace mi in the above equations by the mean m of the observations and a 2 by cr 2 /N, 
with N the number of data points. Thus we can see that the initial prior means m of Alice and Bob will 
progressively be overridden by the data. This process is illustrated in Figure [2j 

Finally, objectivity is ensured by the fact that two scientists in the same state of knowledge should 
assign the same prior, hence their posterior are identical if they observe the same data. The fact that the 
prior assignment eventually becomes irrelevant as better and better data make the posterior likelihood- 
dominated in uncontroversial in principle but may be problematic in practice. Often the data are not 
strong enough to override the prior, in which case great care must be given in assessing how much of 



*As argued above, often it would be a mistake not to do so, for example when trying to estimate a mass m from some data one should 
enforce it to be a positive quantity by requiring that p(m) = for m < 0. 

1 Strictly speaking, Bayes' Theorem holds for discrete probabilities and the passage to hypotheses represented by continuous variables 
ought to be performed with some mathematical care. Here we simply appeal to the intuition of physicists without being too much 
concerned by mathematical rigour. 
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Figure 2. Converging views in Bayesian inference. Two scientists having different prior believes p(6\I{) about the value of a quantity 8 

(panel (a), red and green pdf's) observe one datum with likelihood C(6) (panel (b)), after which their posteriors p(9\m\) (panel (c), 
obtained via Bayes Theorem, Eq. (|8jl) represent their updated states of knowledge on the parameter. After observing 100 data points, 

the two posteriors have become essentially indistinguishable (d). 



the final inference depends on the prior choice. This might occur for small sample sizes, or for problems 
where the dimensionality of the hypothesis space is larger than the number of observations (for example, 
in image reconstruction). Even in such a case, if different prior choices lead to different posteriors we can 
still conclude that the data are not informative enough to completely override our prior state of knowledge 
and hence we have learned something useful about the constraining power (or lack thereof) of the data. 

The situation is somewhat different for model comparison questions. In this case, it is precisely the 
available prior volume that is important in determining the penalty that more complex models with more 
free parameters should incur into (this is discussed in detail in section UJ) . Hence the impact of the prior 
choice is much stronger when dealing with model selection issues, and care should be exercised in assessing 
how much the outcome would change for physically reasonable changes in the prior. 

There is a vast literature on priors that we cannot begin to summarize here. Important issues concern the 
determination of "ignorance priors", i.e. priors reflecting a state of indifference with respect to symmetries 
of the problem considered. "Reference priors" exploit the idea of using characteristics of what the exper- 
iment is expected to provide to construct priors representing the least informative state of knowledge. In 
order to be probability distributions, priors must be proper, i.e. normalizable to unity probability content. 
"Flat priors" are often a standard choice, in which the prior is taken to be constant within some minimum 
and maximum value of the parameters, i.e. for a 1-dimensional case p{9) = (# max — ^min) ■ The rationale 
is that we should assign equal probability to equal states of knowledge. However, flat priors are not always 
as harmless as they appear. One reason is that a flat prior on a parameter 9 does not correspond to a flat 
prior on a non-linear function of that parameter, ip(9). The two priors are related by 

(11) 

so for a non-linear dependence ■*/>(#) the term |d#/dV>| means that an uninformative (flat) prior on 9 might 
be strongly informative about tp (in the multi-dimensional case, the derivative term is replaced by the 
determinant of the Jacobian for the transformation). Furthermore, if we are ignorant about the scale of 
a quantity 9, it can be shown (see e.g. [7]) that the appropriate prior is flat on ln0, which gives equal 
weight to all orders of magnitude. This prior corresponds to p(9) oc 9~ 1 and is called "Jeffreys' prior". It 
is appropriate for example for the rate of a Poisson distribution. For further details on prior choice, and 
especially so-called "objective priors", see e.g. chapter 5 in [28] and references therein. 



pW=p(0) 



0.0 
~c\ih 



3 Bayesian parameter inference 

3.1 The general problem and its solution 

The general problem of Bayesian parameter inference can be specified as follows. We first choose a model 
containing a set of hypotheses in the form of a vector of parameters, 9. The parameters might describe 
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any aspect of the model, but usually they will represent some physically meaningful quantity, such as for 
example the mass of an extra-solar planet or the abundance of dark matter in the Universe. Together with 
the model we must specify the priors for the parameters. Priors should summarize our state of knowledge 
about the parameters before we consider the new data, and for the parameter inference step the prior for 
a new observation might be taken to be the posterior from a previous measurement (for model comparison 
issues the prior is better understood in a different way, see section The caveats about priors and prior 
specifications presented in the previous section will apply at this stage. 

The central step is to construct the likelihood function for the measurement, which usually reflects the 
way the data are obtained. For example, a measurement with Gaussian noise will be represented by a 
Normal distribution, while 7-ray counts on a detector will have a Poisson distribution for a likelihood. 
Nuisance parameters related to the measurement process might be present in the likelihood, e.g. the 
variance of the Gaussian might be unknown or the background rate in the absence of the source might be 
subject to uncertainty. This is no matter of concern for a Bayesian, as the general strategy is always to 
work out the joint posterior for all of the parameters in the problem and then marginalize over the ones 
we are not interested in. Assuming that we have a set of physically interesting parameters 4> and a set of 
nuisance parameters tp, the joint posterior for 9 = (</>, tp) is obtained through Bayes' Theorem: 



^• M)=c{e > P mm (12) 

where we have made explicit the choice of a model M. by writing it on the righ-hand-side of the condition- 
ing symbol. Recall that C{9) = p(d\9,M) denotes the likelihood and p(9\A4) the prior. The normalizing 
constant p(d\A4) ("the Bayesian evidence") is irrelevant for parameter inference (but central to model com- 
parison, see sectionH]), so we can write the marginal posterior on the parameter of interest as (marginalizing 
over the nuisance parameters) 



p{<f>\d,M) oc / £(0,VOp(&V|AW. (13) 

The final inference on from the posterior can then be communicated either by some summary statistics 
(such as the mean, the median or the mode of the distribution, its standard deviation and the correlation 
matrix among the components) or more usefully (especially for cases where the posterior presents mul- 
tiple peaks or heavy tails) by plotting one or two dimensional subsets of cj>, with the other components 
marginalized over. 

In real life there are only a few cases of interest for which the above procedure can be carried out 
analytically. Quite often, however, the simple case of a Gaussian prior and a Gaussian likelihood can offer 
useful guidance regarding the behaviour of more complex problems. An analytical model of a Poisson- 
distributed likelihood for estimating source counts in the presence of a background signal is worked out 
in [14]. In general, however, actual problems in cosmology and astrophysics are not analytically tractable 
and one must resort to numerical techniques to evaluate the likelihood and to draw samples from the 
posterior. Fortunately this is not a major hurdle thanks to the recent increase of cheap computational 
power. In particular, numerical inference often employs a technique called Markov Chain Monte Carlo, 
which allows to map out numerically the posterior distribution of Eq. (|12p even in the most complicated 
situations, where the likelihood can only be obtained by numerical simulation, the parameter space can 
have hundreds of dimensions and the posterior has multiple peaks and a complicated structure. 

For further reading about Bayesian parameter inference, see [10,11,29,30]. For more advanced applica- 
tions to problems in astrophysics and cosmology, see [16,31]. 



3.2 Markov Chain Monte Carlo techniques for parameter inference 

The general solution to any inference problem has been outlined in the section above: it remains to find a 
way to evaluate the posterior of Eq. (|12p for the usual case where analytical solutions do not exist or are 
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insufficiently accurate. Nowadays, Bayesian inference heavily relies on numerical simulation, in particular 
in the form of Markov Chain Monte Carlo (MCMC) techniques, which are discussed in this section. 

The purpose of the Markov chain Monte Carlo algorithm is to construct a sequence of points in parameter 
space (called "a chain"), whose density is proportional to the posterior pdf of Eq. (fT2j) . Developing a full 
theory of Markov chains is beyond the scope of the present article (see e.g. [32,33] instead). For our purposes 
it suffices to say that a Markov chain is defined as a sequence of random variables {X<® , JfW X( M ~V } 
such that the probability of the (t + l)-th element in the chain only depends on the value of the t— 
th element. The crucial property of Markov chains is that they can be shown to converge to a stationary 
state (i.e., which does not change with t) where successive elements of the chain are samples from the target 
distribution, in our case the posterior p{6\d). The generation of the elements of the chain is probabilistic in 
nature, and several algorithms are available to construct Markov chains. The choice of algorithm is highly 
dependent on the characteristics of the problem at hand, and "tayloring" the MCMC to the posterior 
one wants to explore often takes a lot of effort. Popular and effective algorithms include the Metropolis- 
Hastings algorithm [34,35], Gibbs sampling (see e.g. [36]), Hamiltonian Monte Carlo (see e.g. [37] and 
importance sampling. 

Once a Markov chain has been constructed, obtaining Monte Carlo estimates of expectations for any 
function of the parameters becomes a trivial task. For example, the posterior mean is given by (where (•} 
denotes the expectation value with respect to the posterior) 



(*)« P (0\d)0d9 = -J2 e{t) > ( 14 ) 

where the equality with the mean of the samples from the MCMC follows because the samples 0w are 
generated from the posterior by construction. In general, one can easily obtain the expectation value of 
any function of the parameters f(9) as 



M-l 

</w> ^mT, f( e{t) y ( i5 ) 

t=0 

It is usually interesting to summarize the results of the inference by giving the 1 -dimensional marginal 
probability for the j—th element of 9, 6j. Taking without loss of generality j = 1 and a parameter space of 
dimensionality n, the equivalent expression to Eq. ([3]) for the case of continuous variables is 



p (0i|d) = J P (0\d)de 2 ...de n , (16) 

where p{9\\d) is the marginal posterior for the parameter Q\. From the Markov chain it is trivial to obtain 
and plot the marginal posterior on the left -hand-side of Eq. f|16|) : since the elements of the Markov chains 
are samples from the full posterior, p(9\d), their density reflects the value of the full posterior pdf. It is 
then sufficient to divide the range of 9\ in a series of bins and count the number of samples falling within 
each bin, simply ignoring the coordinates values 62, ■ ■ ■ ,9 n - A 2-dimensional posterior is defined in an 
analogous fashion. 

There are several important practical issues in working with MCMC methods (for details see e.g. [32]). 
Especially for high-dimensional parameter spaces with multi-modal posteriors it is important not to use 
MCMC techniques as a black box, since poor exploration of the posterior can lead to serious mistakes in 
the final inference if it remains undetected. Considerable care is required to ensure as much as possible 
that the MCMC exploration has covered the relevant parameter space. 
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4 Bayesian model comparison 

4.1 Shaving theories with Occam's razor 

When there are several competing theoretical models, Bayesian model comparison provides a formal way 
of evaluating their relative probabilities in light of the data and any prior information available. The "best" 
model is then the one which strikes an optimum balance between quality of fit and predictivity. In fact, it 
is obvious that a model with more free parameters will always fit the data better (or at least as good as) 
a model with less parameters. However, more free parameters also mean a more "complex" model, in a 
sense that we will quantify below in section 14.61 Such an added complexity ought to be avoided whenever 
a simpler model provides an adequate description of the observations. This guiding principle of simplicity 
and economy of an explanation is known as Occam's razor — the simplest theory compatible with the 
available evidence ought to be preferred^- Bayesian model comparison offers a formal way to evaluate 
whether the extra complexity of a model is required by the data, thus putting on a firmer statistical 
grounds the evaluation and selection process of scientific theories that scientists often carry out at a more 
intuitive level. For example, a Bayesian model comparison of the Ptolemaic model of epicycles versus the 
heliocentric model based on Newtonian gravity would favour the latter because of its simplicity and ability 
to explain planetary motions in a more economic fashion than the baroque construction of epicycles. 

An important feature is that an alternative model must be specified against which the comparison 
is made. In contrast with frequentist goodness-of-fit tests (such as chi-square tests), Bayesian model 
comparison maintains that it is pointless to reject a theory unless an alternative explanation is available 
that fits the observed facts better (for more details about the difference in approach with frequentist 
hypothesis testing, see [14]). In other words, unless the observations are totally impossible within a model, 
finding that the data are improbable given a theory does not say anything about the probability of the 
theory itself unless we can compare it with an alternative. A consequence of this is that the probability 
of a theory that makes a correct prediction can increase if the prediction is confirmed by observations, 
provided competitor theories do not make the same prediction. This agrees with our intuition that a 
verified prediction lends support to the theory that made it, in contrast with the limited concept of 
falsifiability advocated by Popper (i.e., that scientific theories can only be tested by proving them wrong). 
So for example, perturbations to the motion of Uranus led the French astronomer U.J.J Leverrier and 
the English scholar J.C. Adams to formulate the prediction, based on Newtonian theory, that a further 
planet ought to exist beyond the orbit of Uranus. The discovery of Neptune in 1846 within 1 degree of the 
predicted position thus should strengthen our belief in the correctness of Newtonian gravity. However, as 
discussed in detail in chapter 5 of Ref [7], the change in the plausibility of Newton's theory following the 
discovery of Uranus crucially depends on the alternative we are considering. If the alternative theory is 
Einstein gravity, then obviously the two theories make the same predictions as far as the orbit of Uranus is 
concerned, hence their relative plausibility is unchanged by the discovery. The alternative "Newton theory 
is false" is not useful in Bayesian model comparison, and we are forced to put on the table a more specific 
model than that before we can assess how much the new observation changes our relative degree of belief 
between an alternative theory and Newtonian gravity. 

In the context of model comparison it is appropriate to think of a model as a specification of a set of 
parameters 9 and of their prior distribution, p{6\M). As shown below, it is the number of free parameters 
and their prior range that control the strength of the Occam's razor effect in Bayesian model comparison: 
models that have many parameters that can take on a wide range of values but that are not needed in 
the light of the data are penalized for their unwarranted complexity. Therefore, the prior choice ought 
to reflect the available parameter space under the model M, independently of experimental constraints we 
might already be aware of. This is because we are trying to assess the economy (or simplicity) of the 
model itself, and hence the prior should be based on theoretical or physical constraints on the model under 
consideration. Often these will take the form of a range of values that are deemed "intuitively" plausible, 
or "natural" . Thus the prior specification is inherent in the model comparison approach. 



x In its formulation by the medieval English philosopher and Franciscan monk William of Ockham (ca. 1285-1349): "Pluralitas non est 
ponenda sine neccesitate" . 
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The prime tool for model selection is the Bayesian evidence, discussed in the next three sections. A 
quantitative measure of the effective model complexity is introduced in section 14.61 We then present some 
popular approximations to the full Bayesian evidence that attempt to avoid the difficulty of priors choice, 
the information criteria, and discuss the limits of their applicability in section 14.71 

For reviews on model selection see e.g. [3, 9, 38] and [39] for cosmological applications. Good starting 
points on Bayes factors are [40,41]. A discussion of the spirit of model selection can be found in the first 
part of [42]. 



4.2 The Bayesian evidence 

The evaluation of a model's performance in the light of the data is based on the Bayesian evidence, which 
in the statistical literature is often called marginal likelihood or model likelihood. Here we follow the practice 
of the cosmology and astrophysics community and will use the term "evidence" instead. The evidence is 
the normalization integral on the right -hand-side of Bayes' theorem, Eq. (jGj) , which we rewrite here for a 
continuous parameter space Q,m and conditioning explicitly on the model under consideration, A4: 

p(d\M)= p{d\6,M)p(d\M)d6 (Bayesian evidence). (17) 

Thus the Bayesian evidence is the average of the likelihood under the prior for a specific model choice. 
From the evidence, the model posterior probability given the data is obtained by using Bayes' Theorem 
to invert the order of conditioning: 

p(M\d) ozp(M)p(d\M), (18) 

where we have dropped an irrelevant normalization constant that depends only on the data and p(M) is 
the prior probability assigned to the model itself. Usually this is taken to be non-committal and equal to 
1/N m if one considers N m different models. When comparing two models, M.q versus A4\, one is interested 
in the ratio of the posterior probabilities, or posterior odds, given by 

p(M \d) _ p(Mq) 

p(M!\d) " Q1 p(M x ) Uyj 
and the Bayes factor Bq\ is the ratio of the models' evidences: 

B i = (Bayes factor). (20) 

p{d\Mi) 

A value Bqi > (<) 1 represents an increase (decrease) of the support in favour of model versus model 
1 given the observed data. From Eq. (|19|) it follows that the Bayes factor gives the factor by which the 
relative odds between the two models have changed after the arrival of the data, regardless of what we 
thought of the relative plausibility of the models before the data, given by the ratio of the prior models' 
probabilities. Therefore the relevant quantity to update our state of belief in two competing models is the 
Bayes factor. 

To gain some intuition about how the Bayes factor works, consider two competing models: .Mo predicting 
that a quantity 8 = with no free parameters, and A4\ which assigns 9 a Gaussian prior distribution with 
mean and variance T? . Assume we perform a measurement of 6 described by a normal likelihood of 
standard deviation a, and with the maximum likelihood value lying A standard deviations away from 0, 
i.e. |0 m ax/°i = A. Then the Bayes factor between the two models is given by, from Eq. f)20f) 

A 2 



B m = ,/! + (,/£)-> =p ( - 2(1 + ws)2) l • PU 
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Table 1. Empirical scale for evaluating the strength of evidence 
when comparing two models, A4q versus A4i (so— called "Jeffreys' 
scale"). Threshold values are empirically set, and they occur for 
values of the logarithm of the Bayes factor of |lni?oi| — 1.0, 
2.5 and 5.0. The right— most column gives our convention for de- 
noting the different levels of evidence above these thresholds. 
The probability column refers to the posterior probability of the 
favoured model, assuming non-committal priors on the two com- 
peting models, i.e. p(A4o) — p(-Mi) — 1/2 and that the two mod- 
els exhaust the model space, p{Ado\d) + p{M.i\d) — 1. 



| In Boil 


Odds 


Probability 


Strength of evidence 


< 1.0 


< 3 : 1 


< 0.750 


Inconclusive 


1.0 


~ 3 : 1 


0.750 


Weak evidence 


2.5 


- 12 : 1 


0.923 


Moderate evidence 


5.0 


~ 150 : 1 


0.993 


Strong evidence 



For A > 1, corresponding to a detection of the new parameter at many sigma, the exponential term 
dominates and Bqi <C 1, favouring the more complex model with a non-zero extra parameter, in agreement 
with the usual conclusion. But if A < 1 and cj/S <C 1 (i.e., the likelihood is much more sharply peaked 
than the prior and in the vicinity of 0), then the prediction of the simpler model that 8 = has been 
confirmed. This leads to the Bayes factor being dominated by the Occam's razor term, and Bq\ ~ E/cr, i.e. 
evidence accumulates in favour of the simpler model proportionally to the volume of "wasted" parameter 
space. If however a/T, S> 1 then the likelihood is less informative than the prior and Bqi — > 1, i.e. the data 
have not changed our relative belief in the two models. 

Bayes factors are usually interpreted against the Jeffreys' scale [3] for the strength of evidence, given in 
Table [TJ This is an empirically calibrated scale, with thresholds at values of the odds of about 3 : 1, 12 : 1 
and 150 : 1, representing weak, moderate and strong evidence, respectively. A useful way of thinking of 
the Jeffreys' scale is in terms of betting odds — many of us would feel that odds of 150 : 1 are a fairly 
strong disincentive towards betting a large sum of money on the outcome. Also notice from Table [1] that 
the relevant quantity in the scale is the logarithm of the Bayes factor, which tells us that evidence only 
accumulates slowly and that indeed moving up a level in the evidence strength scale requires about an 
order of magnitude more support than the level before. 

Bayesian model comparison does not replace the parameter inference step (which is performed within 
each of the models separately). Instead, model comparison extends the assessment of hypotheses in the light 
of the available data to the space of theoretical models, as evident from Eq. (|19p . which is the equivalent 
expression for models to Eq. (I12p . representing inference about the parameters value within each model 
(for multi-model inference, merging the two levels, see section [6T2]) . 

4.3 Computation and interpretation of the evidence 

The computation of the Bayesian evidence (|17p is in general a numerically challenging task, as it involves a 
multi-dimensional integration over the whole of parameter space. An added difficulty is that the likelihood 
is often sharply peaked within the prior range, but possibly with long tails that do contribute significantly 
to the integral and which cannot be neglected. Other problematic situations arise when the likelihood is 
multi-modal, or when it has strong degeneracies that confine the posterior to thin sheets in parameter 
space. Until recently, the application of Bayesian model comparison has been hampered by the difficulty 
of reliably estimating the evidence. Fortunately, several methods are now available, each with its own 
strengths and domains of applicability. 

(i) The numerical method of choice until recently has been thermodynamic integration, also called simu- 
lated annealing (see e.g. [11,43,44] and references therein for details). Its computational cost can become 
fairly large, as it depends heavily on the dimensionality of the parameter space and on the characteristic 
of the likelihood function. In typical cosmological applications [45-47], thermodynamic integration can 
require up to 10 7 likelihood evaluations, two orders of magnitude more than MCMC-based parameter 
estimation. 
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(ii) Skilling [48,49] has put forward an elegant algorithm called "nested sampling", which has been imple- 
mented in the cosmological context by [50-54] (for a theoretical discussion of the algorithmic properties, 
see [55]). The gist of nested sampling is that the multi-dimensional evidence integral is recast into a 
one-dimensional integral that is easy to evaluate numerically. This technique allows to reduce the com- 
putational burden to about 10 5 likelihood evaluations^. Recently, the development of what is called 
"multi-modal nested sampling" has allowed to increase significantly the efficiency of the method [53] , 
reducing the number of likelihood evaluations by another order of magnitude. 

(iii) Approximations to the Bayes factor, Eq. (|20p . are available for situations in which the models being 
compared are nested into each other, i.e. the more complex model (Mi) reduces to the original model 
(Mo) for specific values of the new parameters. This is a fairly common scenario in cosmology, where 
one wishes to evaluate whether the inclusion of the new parameters is supported by the data. For 
example, we might want to assess whether we need isocurvature contributions to the initial conditions 
for cosmological perturbations, or whether a curvature term in Einstein's equation is needed, or whether 
a non-scale invariant distribution of the primordial fluctuation is preferred (see Table H] for actual 
results). Writing for the extended model parameters 9 = ((f), ip), where the simpler model .Mo is 
obtained by setting ip = 0, and assuming further that the prior is separable (which is usually the case 
in cosmology), i.e. that 



p(<PMMi) = p(i>\Mi) P (<t>\M ) 



(22) 



the Bayes factor can be written in all generality as 



01 



P (i>\d,Mi) 



p(ifj\M] 



(23) 



•4>=o 



IV 



This expression is known as the Savage-Dickey density ratio (SDDR, see [56] and references therein). 
The numerator is simply the marginal posterior under the more complex model evaluated at the 
simpler model's parameter value, while the denominator is the prior density of the more complex 
model evaluated at the same point. This technique is particularly useful when testing for one extra 
parameter at the time, because then the marginal posterior p(ip\d, Mi) is a 1-dimensional function 
and normalizing it to unity probability content only requires a 1-dimensional integral, which is simple 
to do using for example the trapezoidal rule. 

An instructive approximation to the Bayesian evidence can be obtained when the likelihood function 
is unimodal and approximately Gaussian in the parameters. Expanding the likelihood around its peak 
to second order one obtains the Laplace approximation 



p(d\e,M) ~ £ max exp 



"max me - 



(24) 



where # max is the maximum-likelihood point, £ max the maximum likelihood value and L the likelihood 
Fisher matrix (which is the inverse of the covariance matrix for the parameters). Assuming as a prior 
a multinormal Gaussian distribution with zero mean and Fisher information matrix P one obtains for 
the evidence, Eq. (PTT 



p(d\M) = C 



\F\-V 2 

luax |p|-l/2 



exp 



9 F9) 



(25) 



where the posterior Fisher matrix is F = L + P and the posterior mean is given by 9 = F 



x L9 n 



1 Publicly available modules implementing nested sampling can be found at cosmonest.org [51] and 
http://www.mrao.cam.ac.uk/software/cosmoclust/ [52] (accessed Feb 2008). 
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Figure 3. Illustration of Bayesian model comparison for two nested models, where the more complex model has one extra parameter. 
The outcome of the model comparison depends both on the information content of the data with respect to the a priori available 
parameter space, Tio (horizontal axis) and on the quality of fit (vertical axis, A, which gives the number of sigma significance of the 
measurement for the extra parameter). The contours are computed from Eq. H23I I. assuming a Gaussian likelihood and prior (adapted 

from [58]). 

From Eq. (|25p we can deduce a few qualitatively relevant properties of the evidence. First, the quality of 
fit of the model is expressed by £ m ax; the best-fit likelihood. Thus a model which fits the data better will 
be favoured by this term. The term involving the determinants of P and F is a volume factor, encoding the 
Occam's razor effect. As |P| < \F\, it penalizes models with a large volume of wasted parameter space, i.e. 
those for which the parameter space volume li 7 ! -1 / 2 which survives after arrival of the data is much smaller 
than the initially available parameter space under the model prior, |P| -1 / 2 . Finally, the exponential term 
suppresses the likelihood of models for which the parameters values which maximise the likelihood, $max? 
differ appreciably from the expectation value under the posterior, 6. Therefore when we consider a model 
with an increased number of parameters we see that its evidence will be larger only if the quality-of-fit 
increases enough to offset the penalizing effect of the Occam's factor (see also the discussion in [57]). 

On the other hand, it is important to notice that the Bayesian evidence does not penalizes models 
with parameters that are unconstrained by the data. It is easy to see that unmeasured parameters (i.e., 
parameters whose posterior is equal to the prior) do not contribute to the evidence integral, and hence 
model comparison does not act against them, awaiting better data. 

4.4 The rough guide to model comparison 

The gist of Bayesian model comparison can be summarized by the following, back-of-the-envelope Bayes 
factor computation for nested models. The result is surprisingly close to what one would obtain from the 
more elaborate, fully-fledged evidence evaluation, and can serve as a rough guide for the Bayes factor 
determination. 

Returning to the example of Eq. (|2ip . if the data are informative with respect to the prior on the extra 
parameter (i.e., for a/T, <C 1) the logarithm of the Bayes factor is given approximately by 



where as before A gives the number of sigma away from a null result (the "significance" of the measurement). 
The first term on the right-hand-side is approximately the logarithm of the ratio of the prior to posterior 



In £ i ~ ln(£/<r) - A 2 /2 



(26) 
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volume. We can interpret it as the information content of the data, as it gives the factor by which the 
parameter space has been reduced in going from the prior to the posterior. This term is positive for 
informative data, i.e. if the likelihood is more sharply peaked than the prior. The second term is always 
negative, and it favours the more complex model if the measurement gives a result many sigma away from 
the prediction of the simpler model (i.e., for A 3> 0). We are free to measure the information content in base- 
10 logarithm (as this quantity is closer to our intuition, being the order of magnitude of our information 
increase), and we define the quantity I\q = log 10 (E/a). Figure [3] shows contours of |lni?oi| = const for 
const = 1.0,2.5,5.0 in the (Jio,A) plane, as computed from Eq. ([26]) . The contours delimit significative 
levels for the strength of evidence, according to the Jeffreys' scale (Table [1]). For moderately informative 
data (Jio ~ 1 — 2) the measured mean has to lie at least about 4<r away from in order to robustly disfavor 
the simpler model (i.e., A > 4). Conversely, for A < 3 highly informative data (Iio > 2) do favor the 
conclusion that the extra parameter is indeed 0. In general, a large information content favors the simpler 
model, because Occam's razor penalizes the large volume of "wasted" parameter space of the extended 
model. 

An useful properties of Figure [3] is that the impact of a change of prior can be easily quantified. A 
different choice of prior width (i.e., E) amounts to a horizontal shift across Figure El at least as long as 
Iio > (i.e., the posterior is dominated by the likelihood). Picking more restrictive priors (reflecting more 
predictive theoretical models) corresponds to shifting the result of the model comparison to the left of 
Figure O returning an inconclusive result (white region) or a prior-dominated outcome (hatched region). 
Notice that results in the 2-3 sigma range, which are fairly typical in cosmology, can only support the 
more complex model in a very mild way at best (odds of 3 : 1 at best), while actually being most of the 
time either inconclusive or in favour of the simpler hypothesis (pink shaded region in the bottom right 
corner) . 

Bayesian model comparison is usually conservative when it comes to admitting a new quantity in our 
model, even in the case when the prior width is chosen incorrectly. Consider the following two possibilities: 

• If the prior range is too small, the model comparison result will be non-committal (white region in 
Figure E])) or even prior dominated (hatched region, where the posterior is dominated by the prior). 
Hence in this case we have to hold judgement until better data come along. 

• Too wide a prior will instead unduly favour the simpler model (pink, shaded regions). However, as 
new, better data come along the result will move to the right (for a fixed prior width, as the likelihood 
becomes narrower) but eventually also upwards, towards a larger number of sigma significance, if the 
true model really has a non-zero extra parameter. Eventually, our initial "poor" prior choice will be 
overridden as the number of sigma becomes large enough to take the result into the blue, shaded region. 

In both cases the result of the model comparison will eventually override the "wrong" prior choice 
(although it might take a long time to do so), exactly as it happens for parameter inference. 



4.5 Getting around the prior - The maximal evidence for a new parameter 

For nested models, Eq. (|23p shows that the relative probability of the more complex model can be made 
arbitrarily small by increasing the broadness of the prior for the extra parameters, p{ip\Mi) (as the prior 
is a pdf, it must integrate to unit probability. Hence a broader prior corresponds to a smaller value of 
p(^|A / Ii)^,=o in the denominator). Often, this is not problematical as prior ranges for the new parameters 
can (and should) be motivated from the underlying theory. For example, in assessing whether the scalar 
spectral index {n s ) of the primordial perturbations differs from the scale-invariant value n s = 1, the prior 
range of the index can be constrained to be 0.8 < n, < 1.2 within the theoretical framework of slow roll 
inflation (more on this in section [6]). The sensitivity of the model comparison result can also be investigated 
for other plausible, physically motivated choices of prior ranges, see e.g. [59,60]. If the model comparison 
outcome is qualitatively the same for a broad choice of plausible priors, then we can be confident that the 
result is robust. 

Although the Bayesian evidence offers a well-defined framework for model comparison, there are cases 
where there is not a specific enough model available to place meaningful limits on the prior ranges of 
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Tabic 2. Translation tabic (using Eq. ]27ft ) between frcqucntist 
significance values (p— values) and the upper bounds on the odds 
(Bio) m favour of the more complex model. No other choice of 
prior (within the family considered in the text) will give higher 
evidence in favour of the extra parameters. The "sigma" column 
is the corresponding number of standard deviations away from the 
mean for a normal distribution. The "category" column gives the 
Jeffreys' scale of Tablc[^(from [65]). 
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new parameters in a model. This hurdle arises frequently in cases when the new parameters are a phe- 
nomenological description of a new effect, only loosely tied to the underlying physics, such as for example 
expansion coefficients of some series. An interesting possibility in such a case is to choose the prior on the 
new parameters in such a way as to maximise the probability of the new model, given the data. If, even 
under this best case scenario, the more complex model is not significantly more probable than the simpler 
model, then one can confidently say that the data does not support the addition of the new parameters, 
without worrying that some other choice of prior will make the new model more probable [61-63]. 

Consider the Bayes factor in favour of the more complex model, -Bio = V-^oij with Bqi given by Eq. (|20p . 
The simpler model, .Mo, is obtained from Mq by setting 9 = 9*. An absolute upper bound to the evidence 
in favour of the more complex model is obtained by choosing p(9\M.\) to be a delta function centered at 
the maximum likelihood value under # m ax- It can be shown that in this case the upper bound -Bio 
corresponds to the likelihood ratio between # max an d 9*. However, it can be argued that such a choice for 
the prior is unjustified, as it can only be made ex post facto after one has seen the data and obtained the 
maximum likelihood estimate. It is more natural to appeal to a mild principle of indifference as to the 
value of the parameter under the more complex model, and thus to maximize the evidence over priors that 
are symmetric about 9* and unimodal. This can be shown to be equivalent to maximizing over all priors 
that are uniform and symmetric about 9* . This procedure leads to a very simple expression for the lower 
bound on the Bayes factor [61] 

B w < B w = (27) 
eplnp 

for p < e _1 , where e is the exponential of one. Here, p is the p-value, the probability that the the value 
of some test statistics be as large as or larger than the observed value assuming the null hypothesis (i.e., 
the simpler model) is true (see [61,62] for a detailed discussion). A more precise definition of p-values is 
given in any standard statistical textbook, e.g. [64]. 

Eq. (|27p offers a useful calibration of frequentist significance values (p-values) in terms of upper bounds 
on the Bayesian evidence in favour of the extra parameters. The advantage is that the quantity on the 
left-hand side of Eq. (|27p can be straightforwardly interpreted as an upper bound on the odds for the more 
complex model, whereas the p-values cannot. This point is illustrated very clearly with an astronomical 
example in [66]. In fact, a word of caution is in place regarding the meaning of the p-value, which is 
often misinterpreted as an error probability, i.e. as giving the fraction of wrongly rejected nulls in the 
long run. For example, when a frequentist test rejects the null hypothesis (in our example, that 9 = 9*) 
at the 5% level, this does not mean that one will make a mistake roughly 5% of the time if one were to 
repeat the test many times. The actual error probability is much larger than that, and can be shown to 
be at least 29% (for unimodal, symmetric priors, see Table 6 in [62]). This important conceptual point is 
discussed in in greater detail in [62,63,67]. The fundamental reason for this discrepancy with intuition is 
that frequentist significance tests give the probability of observing data as extreme or more extreme than 
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what has actually been measured, assuming the null hypothesis TCq to be true (which in Bayesian terms 
amounts to the choice of a model, Mo). But the quantity one is interested in is actually the probability 
of the model A4q given the observations, which can only be obtained by using Bayes' Theorem to invert 
the order of conditioning^. Indeed, in frequentist statistics, a hypothesis is either true or false (although 
we do not know which case it is) and it is meaningless to attach to it a probability statement. 

Table [2] lists B\q for some common thresholds for significance values and the strength of evidence scale, 
thus giving a conversion table between significance values and upper bounds on the Bayesian evidence, 
independent of the choice of prior for the extra parameter (within the class of unimodal and symmetric 
priors). It is apparent that in general the upper bound on the Bayesian evidence is much more conservative 
than the p-value, e.g. a 99% result (corresponding to p = 0.01) corresponds to odds of 8 : 1 at best in 
favour of the extra parameters, which fall short of even the "moderate evidence" threshold. Strong evidence 
at best requires at least a 3.6 sigma result. A useful rule of thumb is thus to think of a s sigma result 
as a s — 1 sigma result, e.g. a 99.7% result (3 sigma) really corresponds to odds of 21 : 1, i.e. about 95% 
probability for the more complex model. Thus when considering the detection of a new parameter, instead 
of reporting frequentist significance values it is more appropriate to present the upper bound on the Bayes 
factor, as this represents the maximum probability that the extra parameter is different from its value 
under the simpler model. 

This approach has been applied to the cosmological context in [65], who analysed the evidence in favour 
of a non-scale invariant spectral index and of asymmetry in the cosmic microwave background maps. For 
further details on the comparison between frequentist hypothesis testing and Bayesian model selection, 
see [28]. 

4.6 The effective number of parameters - Bayesian model complexity 

The usefulness of a Bayesian model selection approach based on the Bayesian evidence is that it tells us 
whether the increased "complexity" of a model with more parameters is justified by the data. However, 
it is desirable to have a more refined definition of "model complexity" , as the number of free parameters 
is not always an adequate description of this concept. For example, if we are trying to measure a periodic 
signal in a time series, we might have a model of the data that looks like 



where A, 9, 5 are free parameters we wish to constrain. But if 9 is very small compared to 1 and the noise 
is large compared to 9, then the oscillatory term remains unconstrained by the data and effectively we can 
only measure the normalization A. Thus the parameters 9, 5 should not count as free parameters as they 
cannot be constrained given the data we have, and the effective model complexity is closer to 1 than to 3. 
From this example it follows that the very notion of "free parameter" is not absolute, but it depends on 
both what our expectations are under the model, i.e. on the prior, and on the constraining power of the 
data at hand. 

In order to define a more appropriate measure of complexity, in [68] the notion of Bayesian complexity was 
introduced, which measures the number of parameters that the data can support. Consider the information 
gain obtained when upgrading the prior to the posterior, as measured by the the Kullback-Leibler (KL) 
divergence [69] between the posterior, p and the prior, denoted here by it: 



In virtue of Bayes' theorem, p(9\d,M) = C{9)t:(9\M.) /p{d\M) hence the KL divergence becomes the sum 



lr Tb convince oneself of the difference between the two quantities, consider the following example [22], Imagine selecting a person at 
random — the person can either be male or female (our hypothesis). If the person is female, her probability of being pregnant (our data) 
is about 3%, i.e. p(pregnant| female) = 0.03. However, if the person is pregnant, her probability of being female is much larger than that, 
i.e. (female | pregnant) 2> 0.03. The two conditional probabilities are related by Bayes' theorem. 



f(t) = A{\ + 9 cos{t + 5)) 



(28) 




(29) 
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of the negative log evidence and the expectation value of the log-likelihood under the posterior: 



D Kh (p,ir) =-]np(d\M) + J p(6\d,M) In £(0)d0. 



(30) 



To gain a feeling for what the KL divergence expresses, let us compute it for a 1-dimensional case, with a 
Gaussian prior around of variance E and a Gaussian likelihood centered on $niax and variance c . We 
obtain after a short calculation 



D KL (p,ir) 



i ° 1 

In h - 

S 2 



(31) 



The second term on the right-hand side gives the reduction in parameter space volume in going from the 
prior to the posterior. For informative data, cr/E <C 1, this terms is positive and grows as the logarithm 
of the volume ratio. On the other hand, in the same regime the third term is small unless the maximum 
likelihood estimate is many standard deviations away from what we expected under the prior, i.e. for 
^max/c ^ 1- This means that the maximum likelihood value is "surprising", in that it is far from what 
our prior led us to expect. Therefore we can see that the KL divergence is a summary of the amount of 
information, or "surprise", contained in the data. 
Let us now define an effective x 2 through the likelihood as C(6) = exp(— x 2 /2). Then Eq. (|30p gives 



D kl (p,tt) = -^(9) +\np(d\M), 



(32) 



where the bar indicates a mean taken over the posterior distribution. The posterior average of the effective 
chi-square is a quantity can be easily obtained by Markov chain Monte Carlo techniques (see section [372]) . 
We then subtract from the "expected surprise" the estimated surprise in the data after we have actually 
fitted the model parameters, denoted by 



D Kh = -^ X 2 0) + ^p(d\M), 



(33) 



where the first term on the right-hand-side is the effective chi-square at the estimated value of the 
parameters, indicated by a hat. This will usually be the posterior mean of the parameters, but other 
possible estimators are the maximum likelihood point or the posterior median, depending on the details 
of the problem. We then define the quantity 



C b = -2[ £>kl(p, 7r) - D K l ) =X 2 (9) -X (#) (Bayesian complexity), 



(34) 



(notice that the evidence term is the same in Eqs. (|32p and (|33|) as it does not depend on the parameters 
and therefore it disappears from the complexity). The Bayesian complexity gives the effective number of 
parameters as a measure of the constraining power of the data as compared to the predictivity of the model, 
i.e. the prior. Hence C depends both on the data and on the prior available parameter space. This can be 
understood by considering further the toy example of a Gaussian likelihood of variance a 2 around # m ax and 
a Gaussian prior around of variance E 2 . Then a short calculation shows that the Bayesian complexity is 
given by (see [70] for details) 



1 



l + (a/E)2' 



(35) 



So for cr/E <C 1, Cb ~ 1 and the model has one effective, well constrained parameter. But if the likelihood 
width is large compared to the prior, <r/E S> 1, then the experiment is not informative with respect to our 
prior beliefs and Ci — > 0. 
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The Bayesian complexity can be a useful diagnostic tool in the tricky situation where the evidence 
for two competing models is about the same. Since the evidence does not penalize parameters that are 
unmeasured, from the evidence alone we cannot know if we are in the situation where the extra parameters 
are simply unconstrained and hence irrelevant (9 <C 1 in the example of Eq. (|28p ) or if they improve the 
quality-of-fit just enough to offset the Occam's razor penalty term, hence giving the same evidence as the 
simpler model. The Bayesian complexity breaks this degeneracy in the evidence allowing to distinguish 
between the two cases: 

(i) p(d\A4o) ~ p(d\M.\) and Cb{M.\) > Cb(Mo): the quality of the data is sufficient to measure the 
additional parameters of the more complicated model (.Mi), but they do not improve its evidence by 
much. We should prefer model A4o, with less parameters. 

(ii) p{d\M.o) ~ p(d\M.{) and Cb(M-i) ~ Cb(Mo): both models have a comparable evidence and the effective 
number of parameters is about the same. In this case the data is not good enough to measure the 
additional parameters of the more complicated model (given the choice of prior) and we cannot draw 
any conclusions as to whether the extra parameter is needed. 

The first application of this technique in the cosmological context is in [70], where it is applied to the 
number of effective parameters from cosmic microwave background data, while in [71] it was used to 
determine the number of effective dark energy parameters. 



4.7 Information criteria for approximate model comparison 

Sometimes it might be useful to employ methods that aim at an approximate model selection under some 
simplifying assumptions that give a default penalty term for more complex models, which replaces the 
Occam's razor term coming from the different prior volumes in the Bayesian evidence. While this is an 
obviously appealing feature, on closer examination it has the drawback of being meaningful only in fairly 
specific cases, which are not always met in astrophysical or cosmological applications. In particular, it can 
be argued that the Bayesian evidence (ideally coupled with an analysis of the Bayesian complexity) ought 
to be preferred for model building since it is precisely the lack of predictivity of more complicated models, 
as embodied in the physically motivated range of the prior, that ought to penalize them. 

With this caveat in mind, we list below three types of information criteria that have been widely used 
in several astrophysical and cosmological contexts. An introduction to the information criteria geared 
for astrophysicists is given by Ref. [72]. A discussion of the differences between the different information 
criteria as applied to astrophysics can be found in [73] , which also presents a few other information criteria 
not discussed here. 

• Akaike Information Criterion (AIC): Introduced by Akaike [74], the AIC is an essentially frequentist 
criterion that sets the penalty term equal to twice the number of free parameters in the model, k: 

AIC = —2 In £ max + 2A; (36) 

where £ max = p(d\0 mSuX , A4) is the maximum likelihood value. The derivation of the AIC follows from an 
approximate minimization of the KL divergence between the true model distribution and the distribution 
being fitted to the data. 

• Bayesian Information Criterion (BIC): Sometimes called "Schwarz Information Criterion" (from 
the name of its proposer [75] ) , the BIC follows from a Gaussian approximation to the Bayesian evidence 
in the limit of large sample size: 

BIC = —2 In £ max + k\nN (37) 

where k is the number of fitted parameters as before and N is the number of data points. The best 
model is again the one that minimizes the BIC. 
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• Deviance Information Criterion (DIC): Introduced by [68], the DIC can be written as 

DIC = -2D^l + 2C b . (38) 

In this form, the DIC is reminiscent of the AIC, with the ln£ max term replaced by the estimated KL 
divergence and the number of free parameters by the effective number of parameters, C , from Eq. (|34|) . 
Indeed, in the limit of well-constrained parameters, the AIC is recovered from (|38|) . but the DIC has 
the advantage of accounting for unconstrained directions in parameters space. 

The information criteria ought to be interpreted with care when applied to real situations. Comparison 
of Eq. (I37D with Eq. (136[) shows that for N > 7 the BIC penalizes models with more free parameters more 
harshly than the AIC. Furthermore, both criteria penalize extra parameters regardless of whether they 
are constrained by the data or not, unlike the Bayesian evidence. This comes about because implicitly 
both criteria assume a "data dominated" regime, where all free parameters are well constrained. But in 
general the number of free parameters might not be a good representation of the actual number of effective 
parameters, as discussed in section 14.61 In a Bayesian sense it therefore appears desirable to replace the 
number of parameters k by the effective number of parameters as measured by the Bayesian complexity, 
as in the DIC. 

It is instructive to inspect briefly the derivation of the BIC. The unnormalized posterior g(9) = 
C(9)p(9\M) can be approximated by a multi-variate Gaussian around its mode 9, i.e. g(9) « g(9) — 
1/2(0 — 9) l F(9 — 9), where F is minus the Hessian of the posterior evaluated at the posterior mode. Then 
the evidence integral can be computed analytically, giving 

p(d\M) w exp( 5 (0))(2vr) fc / 2 |Fr 1 / 2 . (39) 

For large samples, N — > oo, the posterior mode tends to the maximum likelihood point, 9 — > # max , hence 
g{9) — > £ m axP(#max) and the log-evidence becomes 

lnp(d[7W)^ln/: max +lnp(0 max ) + ^ln(27r)-iln|F| (N -> oo), (40) 

where the error introduced by the various approximations scales to leading order as 0(N^ 1 ^ 2 ). On the 
right-hand-side, the first term scales as O(N), the second and third terms are constant in N, while the 
last term is given by In \F\ ~ A; In N, since the variance scales as the number of data points, N. Dropping 
terms of order 0(1) or below^, we obtain 

lnp(d\M) ^lnC max -^lnN (N — » oo). (41) 

Thus maximising the evidence of Eq. (|4ip is equivalent to minimizing the BIC in Eq. (|37p . We see that 
dropping the term lnp(# max ) in (|40p means that effectively we expect to be in a regime where the model 
comparison is dominated by the likelihood, and that the prior Occam's razor effect becomes negligible. This 
is often not the case in cosmology. Furthermore, this "weak" prior choice is intrinsic (even though hidden 
from the user) in the form of the BIC, and often it is not justified. In conclusion, it appears that what 
makes the information criteria attractive, namely the absence of an explicit prior specification, represents 
also their intrinsic limitation. 



See [76] for a more careful treatment, where a better approximation is obtained by assuming a weak prior which contains the same 
information as a single datum. 
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5 Cosmological parameter inference 

Driven by the emergence of inexpensive sensors and computing capabilities, the amount of cosmological 
data has been increasing exponentially over the last 15 years or so. For example, the first map of cosmic 
microwave background (CMB) anisotropics obtained in 1992 by COBE [77] contained ~ 10 3 pixels, which 
became ~ 5 x 10 4 by 2002 with CBI [78,79]. Current state-of-the art maps (from the WMAP satellite [80]) 
involve ~ 10 6 pixels, which are set to grow to ~ 10 7 with Planck in the next couple of years. Similarly, 
angular galaxy surveys contained ~ 10 6 objects in the 1970's, while by 2005 the Sloan Digital Sky Sur- 
vey [81] had measured ~ 2x 10 8 objects, which will increase to ~ 3 x 10 9 by 2012 when the Large Synoptic 
Survey Telescope [82] comes onlinej. 

This data explosion drove the adoption of more efficient map making tools, faster component separation 
algorithms and parameter inference methods that would scale more favourably with the number of dimen- 
sions of the problem. As data sets have become larger and more precise, so has grown the complexity of the 
models being used to describe them. For example, if only 2 parameters could meaningfully be extracted 
from the COBE measurement of the large-scale CMB temperature power spectrum (namely the normal- 
ization and the spectral tilt [77,83]), the number of model parameters had grown to 11 by 2002, when 
smaller-scale measurements of the acoustic peaks had become available. Nowadays, parameter spaces of 
up to 20 dimensions are routinely considered. 

This section gives an introduction to the broad problem of cosmological parameter inference and high- 
lights some of the tools that have been introduced to tackle it, with particular emphasis on innovative 
techniques. This is a vast field and any summary is bound to be only sketchy. We give throughout references 
to selected papers covering both historically important milestones and recent major developments. 

5.1 The "vanilla" ACDM cosmological model 

Before discussing the quantities we are interested in measuring in cosmology (the "cosmological param- 
eters") and some of the observational probes available to do so, we briefly sketch the general framework 
which goes under the name of "cosmological concordance model" . Because it is a relatively simple scenario 
containing both a cosmological constant (A) and cold dark matter (CDM) (more about them below), it is 
also known as the "vanilla" ACDM model. 

Our current cosmological picture is based on the scenario of an expanding Universe, as implied by the 
observed redshift of the spectra of distant galaxies (Hubble's law). This in turn means that the Universe 
began from a hot and dense state, the initial singularity of the Big Bang. The existence of the cosmic 
microwave background lends strong support to this idea, as it is interpreted as the relic radiation from the 
hotter and denser primordial era. The expanding spacetime is described by Einstein's general relativity. The 
cosmological principle states that the Universe is isotropic (i.e., the same in all directions) and homogeneous 
(the same everywhere). If follows that an isotropically expanding Universe is described by the so-called 
Friedmann-Robertson- Walker metric, 



where k defines the geometry of spatial sections (if n is positive, the geometry is spherical; if it is zero, the 
geometry is flat; if it is negative, the geometry is hyperbolic). The scale factor a(t) describes the expansion 
of the Universe, and it is related to redshift z by 




(42) 



l + z 




(43) 



o(t) 



2 Alex Szalay, talk at the specialist discussion meeting "Statistical challenges in cosmology and astroparticle physics", held at the Royal 
Astronomical Society, London, Oct 2007. 
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where a(to) is the scale factor today and a{t) the scale factor at redshift z. The relation between redshift 
and comoving distance r is obtained from the above metric via the Friedmann equation, and is given by 

a dr = ±- + z? + + («6 + ncdm)(l + zf + (Jly + + z) 4 ] " 1/2 dz, (44) 

-no 

where i?o is the Hubble constant, and fi x are the cosmological parameters describing the matter-energy 
content of the Universe. Standard parameters included in the vanilla model are neutrinos (£l v , with a mass 
< 1 eV), photons (0 7 ), baryons cold dark matter (fJ c dm) and a cosmological constant (Oa)- The 

curvature term Q K is included for completeness but is currently not required by the standard cosmological 
model (see section [6]). The comoving distance determines the apparent brightness of objects, their apparent 
size on the sky and the number density of objects in a comoving volume. Hence measurements of the 
brightness of standard candles, of the length of standard rulers or of the number density of objects at a 
given redshift leads to the determination of the cosmological parameters in Eq. (|44p (see next section) . 

The currently accepted paradigm for the generation of density fluctuations in the early Universe is 
inflation. The idea is that quantum fluctuations in the primordial era were stretched to cosmological scales 
by an initial period of exponential expansion, called "inflation" , possibly driven by a yet unknown scalar 
field. This increased the scale factor by about 26 orders of magnitude within about 10 _32 s after the Big 
Bang. Although presently we have only indirect evidence for inflation, it is commonly accepted that such a 
short burst of exponential growth in the scale factor is required to solve the horizon problem, i.e. to explain 
why the CMB is so highly homogeneous across the whole sky. The quantum fluctuations also originated 
temperature anisotropies in the CMB, whose study has proved to be one of the powerhouses of precision 
cosmology. From the initial state with small perturbations imprinted on a broadly uniform background, 
gravitational attraction generated the complex structures we see in the modern Universe, as indicated 
both by observational evidence and highly sophisticated computer modelling. 

Of course it is possible to consider completely different models, based for example on alternative theories 
of gravity (such as Bekenstein's theory [84,85] or Jordan-Brans-Dicke theory [86]), or on a different way 
of comparing model predictions with observations [87-89]. Discriminating among models and determining 
which model is in best agreement with the data is a task for model comparison techniques, whose applica- 
tion to cosmology is discussed section [6j Here we will take the vanilla ACDM model as our starting point 
for the following considerations on cosmological parameters and how they are measured. 

5.2 Cosmological observations 

The discovery of temperature fluctuations in the Cosmic Microwave Background (CMB) in 1992 by the 
COBE satellite [77] marked the beginning of the era of precision cosmology. Many other observations have 
contributed to the impressive development of the field in less than 20 years. For example, around 1990 
the picture of flat Universe with both cold dark matter and a positive cosmological constant was only 
beginning to emerge, and only thanks to the painstakingly difficult work of gluing together several fairly 
indirect clues [90]. At the time of writing (January 2008), the total density is known with an error of 
order 1%, and it is likely that this uncertainty will be reduced by another two orders of magnitude in 
the mid-term [91]. The high accuracy of modern precision cosmology rests on the combination of several 
different probes, that constrain the physical properties of the Universe at many different redshifts. 

(i) Cosmic microwave background (CMB): CMB anisotropies offer a snapshot of the Universe at the 
time of recombination, about 380,000 years after the Big Bang, at a redshift z ~ 1100. As described 
above, the temperature differences measured in the CMB arise from quantum fluctuations during the 
inflationary phase. Their usefulness lies in the fact that they are small (AT/T ~ 10~ 5 ) and hence linear 
perturbation theory is mostly sufficient to predict very accurately their statistical distribution. The 2- 
point correlation function of the anisotropies is usually described via its Fourier transform, the angular 
power spectrum, which presents a series of characteristic peaks called acoustic oscillations, see e.g. [92, 
93]. Their structure depends in a rich way on the cosmological parameters introduced in Eq. (|44p . as 
well as on the initial conditions for the perturbations emerging from the inflationary era (see e.g. [94-96] 
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Figure 4. State— of— the-art cosmic microwave background temperature power spectrum measurements along with the best-fit 
ACDM model (solid line), showing data from WMAP 3-yr [80], the Boomerang 2003 flight [101] and ACBAR [97] (from [97]). 

for further details). The anisotropies are polarized at the level of 1%, and measuring accurately the 
information encoded by the weak polarization signal is the goal of many ongoing observations. State- 
of-the art measurements are described in e.g. [80,97-101]. An example of recent measurements of the 
temperature power spectrum is shown in Figure HI Later this year, the Planck satellite is expected to 
start full-sky, high-resolution observations of both temperature and polarization. 

(ii) Large scale structures (LSS): the correlation function among galaxies gives an estimate of the 
correlation properties of the underlying dark matter distribution, up to a bias factor relating the dark 
matter to the baryon distribution. Current data typically extend out to z ~ 0.7. The resulting power 
spectrum (recent data are shown in Figure Ej) depends mainly on the ratio of the radiation to matter 
energy density, on the initial distribution of the perturbations with scale (spectral index) and on the 
overall normalization (which can be extracted once a bias model is specified). Heavy numerical simula- 
tion is nowadays used to model accurately small scales, where non-linear effects become dominant. The 
tool of choice to measure the power spectrum on small scales is becoming the observations of absorp- 
tion lines from neutral hydrogen clouds, the so-called Lyman-a forest [102], although concerns remain 
about the reliability of the theoretical modelling of non-linear effects. Recently, both the Sloan Digital 
Sky Survey [103] and the 2dF Galaxy Redshift Survey [104] have detected the presence of baryonic 
acoustic oscillations, which appear as a bump in the galaxy-galaxy correlation function corresponding 
to the scale of the acoustic oscillations in the CMB. The physical meaning is that galaxies tend to form 
preferentially at a separation corresponding to the characteristic scale of inhomogeneities in the CMB. 
Baryonic oscillations can be used as rulers of known length (measured via the CMB acoustic peaks) 
located at a much smaller redshift than the CMB (currently, z ~ 0.3), and hence they are powerful 
probes of the recent expansion history of the Universe, with particular focus on dark energy properties. 
The distribution of clusters with redshift can also be employed to probe the growth of perturbations 
and hence to constrain cosmology. Current galaxy redshift surveys have catalogued about half a million 
objects, but a new generation of surveys aims at taking this number to a over a billion. 

(iii) Supernovae type la: the commonly accepted scenario for the formation of supernovae type la is 
a white dwarf accreting material from a binary companion. The core heats up as the gravitational 
pressure increases, eventually leading to carbon fusion ignition, followed by oxygen burning. This 
runaway reaction releases within seconds a huge amount of energy, resulting in a violent explosion which 
is accompanied by a massive surge in luminosity. Observationally, type la supernovae are characterized 
by the absence of hydrogen lines in their spectrum and they are considered almost standard candles, in 
the sense that there is a strong empirical correlation between their duration and their peak luminosity. 
From measurements of their brightness as a function of time (the light curve) , their intrinsic luminosity 
can be reconstructed. The data are then used to reconstruct the redshift-distance relationship, i.e. the 
Hubble diagram, which in turn depends on the cosmological parameters [106-109]. Current data extend 
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Figure 5. Current matter power spectrum measurements from the Sloan Digital Sky Survey and best-fit ACDM power spectrum 
(solid/dashed lines, without/with non-linear corrections), corresponding to the cosmological parameters extracted from WMAP 3-yr 
CMB data (the top and bottom curves are for two different samples). This shows that even without fitting to matter power spectrum 
data, the best-fit CMB model is in good agreement with the galaxy distribution data (from [105]). 

out to about z ~ 1.4 and encompass a few hundreds of supernovae (a number which will increase by a 
factor of 10 thanks to planned searches). Supernovae data were the first line of evidence in 1998 [110,111] 
that the expansion of the Universe is accelerating - an effect attributed to the existence of dark energy 
(see e.g. [112]). 

(iv) Weak gravitational lensing: the presence of inhomogeneities in the distribution of matter along 
the line of sight distorts the shape of background galaxies due to the deflection of light rays. This is 
most spectacularly displayed in observations of strong lensing, showing the characteristic arc-shaped 
multiple images of background galaxies. However, the same physics affects to a much smaller degree 
the shape of any background galaxy, distorting it by about 0.1 to 1%. This is called weak gravitational 
lensing. Although it is impossible to measure such small distortions for any single galaxy, the effect can 
still be detected statistically, by correlating the shear pattern (distortion due to gravitational lensing) 
of several thousand galaxies. The resulting weak lensing power spectrum probes a combination of the 
geometrical setup (distance to the background sources and to the lens) and of the parameters controlling 
the growth of structures, in particular the amount of matter (both visible and dark) and the strength 
of the perturbations (for a recent review, see [113] Some recent observations are reported in [114-117]). 
By dividing the source galaxies into slices of different redshift, it is possible to carry out a sort of 
"cosmic tomography", reconstructing the dark matter distribution between us and the sources [118]. 
Although it has not yet reached the same level of precision of the CMB, weak lensing shows great 
promise for the future in constraining cosmological parameters and in particular dark energy. 

5.3 Constraining cosmological parameters 

As outlined is section [3TT1 our inference problem is fully specified once we give the model (which parameters 
are allowed to vary and their prior distribution) and the likelihood function for the data sets under 
consideration. For the cosmological observations described above, relevant cosmological parameters can be 
broadly classified in four categories. 

(i) Parameters describing the dynamics of the background evolution: they represent the matter-energy 
content of the Universe and its expansion history, relating redshift with comoving distance, see Eq. (|44l) . 
The Hubble constant today is written as Hq = WOh km/s/Mpc, and is used to define the critical 
energy density, i.e. the energy density needed, to make the Universe spatially flat: /?crit — 

1.88xl0" 29 /i 2 
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g/cm 3 . The remaining density parameters (Cl x in Eq. (|44jl ) are then written in units of the critical 
energy density, so that for example the energy density in baryons is given by pb = O&Pcrit- Standard 
parameters include the energy density in photons (f2 7 ), neutrinos baryons cold dark matter 

(^cdm); cosmological constant (Oa) or, more generally, a possibly time-dependent dark energy (fide)- 
(ii) Parameters describing the initial conditions for the fluctuations: they give the type of initial conditions, 
adiabatic (where the spatial distribution of fluctuations is the same for all fluids emerging from inflation, 
up to multiplicative factors) or isocurvature (where there is a mismatch between perturbations among 
two components). The most general type of initial conditions is described by a correlation matrix that 
contains 10 free parameters representing the excitation amplitude of each mode [119-121]. The simplest 
parameterization of the distribution of perturbations with scale is then given for each mode in terms 
of a spectral index. 

(hi) Nuisance parameters: these often relate to insufficiently constrained aspects of the physics of the ob- 
served objects, or to uncertainties in the measuring process. We are not interested in determining them, 
but accounting for their uncertainty is important in order to obtain a correct estimate of the error on 
the physical parameters we are seeking to determine. If the observable quantity has a strong depen- 
dence on poorly determined nuisance parameters, then simply fixing the nuisance parameters instead 
of marginalizing over them will lead to serious underestimation of the uncertainty for the remaining 
parameters. Example of nuisance parameters are the bias factor in galaxy surveys, residual beam cal- 
ibration uncertainty for CMB data, supernovae intrinsic evolution parameters, intrinsic ellipticity of 
background galaxies in weak lensing and others. 

(iv) Parameters describing new physics: this is where the exciting frontier of data analysis lies, and we are 
trying to constrain or detect effects arising from new physics in the model, such as time-variation of the 
fine structure constant, the presence of cosmic strings, the mass of neutrinos, non-trivial topology of the 
Universe, extra dimensions, time-variations of dark energy properties, and much more. Although often 
framed as a parameter inference problem, this is actually a model comparison question, and is therefore 
best tackled with the methods described in section [H Therefore, in this case the parameter inference 
step is only the first level towards working out the outcome of the higher level model comparison step. 

5.3.1 The joint likelihood function. When the observations are independent, the log-likelihoods for 
each observation simply adcQ. Defining x 2 = — 21n£, we have that 

Xtot = XCMB + XSN + Xlens + XhSS + ••• ( 45 ) 

One important advantage of combining different observations as in Eq. (I45p is that each observable has 
different degenerate directions, i.e. directions in parameter space that are poorly constrained by the data. 
By combining two or more types of observables, it is often the case that the two data sets together 
have a much stronger constraining power than each one of them separately, because they mutually break 
parameters degeneracies. Combination of data sets should never be carried out blindly, however. The 
danger is that the data sets might be mutually inconsistent, in which case combining them singles out 
in the posterior a region that is not favoured by any of the two data sets separately, which is obviously 
unsatisfactory. Such discrepancies might arise because of undetected systematics, or insufficient modelling 
of the observations. 

In order to account for possible discrepancies of this kind, Ref. [122] suggested to replace Eq. (135]) by 

Xtot = ° iX i ( 46 ) 
i 

where a, are (unknown) weight factors ( "hyperparameters" ) for the various data sets, which determine 
the relative importance of the observations. A non-informative prior is specified for the hyperparameters, 



1 This is of course not the case when one is carrying out correlation studies, where the aim is precisely to exploit correlation among 
different observables (for example, the late Integrated Sachs- Wolfe effect). 
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which are then integrated out in a Bayesian way, obtaining an effective chi-square 




(47) 



where Ni is the number of data points in data set %. This method has been applied to combine different 
CMB observations in the pre-WMAP era [122,123]. A technique based on the comparison of the Bayesian 
evidence for different data sets has been employed in [124], while Ref. [125] uses a technique similar in spirit 
to the hyperparameter approach outlined above to perform a binning of mutually inconsistent observations 
suffering from undetected systematics, as explained in [126]. 

After the likelihood has been specified, it remains to work out the posterior pdf, usually obtained 
numerically via MCMC technology, and report posterior constraints on the model parameters, e.g. by 
plotting 1 or 2-dimensional posterior contours. We now sketch the way this program has been carried out 
as far as cosmological parameter estimation is concerned. 

5.3.2 Likelihood based ■parameter determination. Up until around 2002, the method of choice for 
cosmological parameter estimation was either direct numerical maximum likelihood search [127, 128] or 
evaluation of the likelihood on a grid in parameter space. Once the likelihood has been mapped out, 
(frequentist) confidence intervals for the parameters are obtained by finding the maximum-likelihood 
point (or, equivalently, the minimum chi-square) and by delimiting the region of parameter space where 
the log-likelihood drops by a specified amount (details can be found in any standard statistics textbook). 
If the likelihood is a multi-normal Gaussian, then this procedure leads to the familiar "delta chi-square" 
rule-of-thumb, i.e. that e.g. a 95.4% (2a) confidence interval for 1 parameter is delimited by the region 
where the x 2 = — 21og£ increases by Ax 2 = 4.00 from its minimum value (see e.g. [43]). Of course the 
value of Ax 2 depends both on the number of parameters being constrained and on the desired confidence 
interval 

Approximate confidence intervals for each parameter were then usually obtained from the above proce- 
dure, after maximising the likelihood across the hidden dimensions rather than marginalising [130,131], 
since the latter procedure required a computationally expensive multi-dimensional integration. The ratio- 
nale was that maximisation is approximately equivalent to marginalisation for close-to-Gaussian problems 
(a simple proof can be found in Appendix A of [132]), although it was early recognized that this is not always 
a good approximation for real-life situations [133]. Marginalisation methods based on multi-dimensional 
interpolation were devised and applied in order to improve on this respect [132,134]. Many studies adopted 
this methodology, which could not quite be described as fully Bayesian yet since it was likelihood-based 
and the the notion of posterior was not explicitly introduced. Often, the choice of particular theoretical 
scenarios (for example, a flat Universe or adiabatic initial conditions) or the inclusion of external con- 
straints (such as bounds on the baryonic density coming from Big Bang nucleosyhntesis) were described as 
"priors" . A more rigorous terminology would call the former a model choice, while the latter amounts to 
inclusion (in the likelihood) of external information. Since the likelihood could be well approximated by a 
simple log-normal distribution, its computation cost was fairly low. With the advent of CMBFAST [135], the 
availability of a fast numerical code for the computation of CMB and matter power spectra meant that 
grids of up to 30 million points and parameter spaces of dimensionality up to order 10 could be handled 
in this way [134]. 



x An important technical point is that frequentist confidence intervals are considered random variables — they give the range within 
which our estimate of the parameter value will fall e.g. 95.4% of the time if we repeat our measurement N — > oo times. The true 
value of the parameter is given (although unknown to us) and has no probability statement attached to it. On the contrary, Bayesian 
credible intervals containing for example 95.4% of the posterior probability mass represent our degree of belief about the value of the 
parameter itself. Often, Bayesian credible intervals are imprecisely called "confidence intervals" (a term that should be reserved for the 
frequentist quantity), thus fostering confusion between the two. Perhaps this happens because for Gaussian cases the two results are 
formally identical, although their interpretation is profoundly different. This can have important consequences if the true value is near 
the boundary of the parameter space, in which case the results from the frequentist and Bayesian procedure may differ substantially - 
see [129] for an interesting example involving the determination of neutrino masses. 
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Table 3. State— of— the art cosmological parameter inference from WMAP 3— year CMB data [80] and Sloan Digital Sky Survey data [105]. Posterior 
median and 68% posterior region, obtained for flat priors on the parameter set in the top section, with the exception of the reionization optical depth 
r, for which a flat prior has been adopted on exp(— 2r) instead (adapted from [105]). 



Parameter Value 



Meaning 



Definition 



Matter budget parameters 



u.59is_ 0020 
n n999+ - 0007 

u - uzz -0.0007 

u.iuou_ 0040 
Initial conditions parameters 



0.69CT 



-,+0.045 
J -0.044 
953+ 016 



CMB acoustic angular scale fit (degrees) 

Baryon density 

Cold dark matter density 



Scalar fluctuation amplitude 
u , ! ! , , Scalar spectral index 

Reionization history (abrupt reionization) 



6s = r a (z ICC )/d A (z ICC ) X 180/7T 

ui b = n b h 2 « p b /(1.88 x 10- 26 kg/m 3 ) 

lo c = Q cdm h 2 w p c /(1.88 x 10- 26 kg/m 3 ) 



Primordial scalar power at k = 0.05/Mpc 
Primordial spectral index at k = 0.05/Mpc 



n n«7+ ' 028 

U.U8f_ 030 



Reionization optical depth 
Nuisance parameters (for galaxy power spectrum) 

1.896lo gg Galaxy bias factor 

30.3^4 * Nonlinear correction parameter 



('ill 



See [105] for details. 
See [105] for details. 



Derived parameters (functions of those above) 



^tot 

h 

n b 

n c 

n A 

C8 



1.00 (flat Universe assumed) 

u./ju_ 019 

041 fi+ 0019 
u.u4io_ 0018 

7+0.016 
-0.015 
-,+0.018 
-0.017 

761 + 017 
O. 'Dl_o .018 

U. 'OO_ .035 



0.197Z 
0.239 4 



Total density/critical density 

Hubble parameter 

Baryon density /critical density 

CDM density/critical density 

Matter density /critical density 

Cosmological constant density/critical density 

Density fluctuation amplitude 



^tot — + — 1 — 

h = \J (ui b + aj c )/(f2tot - ^a) 
f2{, = u>i,/h 2 

Sicdm = UJ c /h 2 
r2 m = £l b + ^cdm 

fl A a h-' 2 PA (1.88 x 10- 26 kg/m 3 ) 
See [105] for details. 



5.3.3 Bayes in the sky — The rise of MCMC. The watershed moment after which methods based 
on likelihood evaluation on a grid where definitely overcome by Bayesian MCMC methods can perhaps be 
indicated in Ref. [136], which marked one of the last major studies performed using essentially frequentist 
techniques. Pioneering works in using MCMC technology for cosmological parameter extraction include the 
application to VSA data [45,137], the use of simulated annealing [138] and the study of Ref. [139]. But it 
was the release of the CosmoMC cod^in 2002 [140] that made a huge impact on the cosmological community, 
as CosmoMC quickly became a standard and user-friendly tool for Bayesian parameter inference from CMB, 
large scale structure and other data. The favourable scalability of MCMC methods with dimensionality of 
the parameter space and the easiness of marginalization were immediately recognized as major advantages 
of the method. CosmoMC employs the CAMB code [141] to compute the matter and CMB power spectra from 
the physical model parameters. It then employs various MCMC algorithms to sample from the posterior 
distribution given current CMB, matter power spectrum (galaxy power spectrum, baryonic acoustic wiggles 
and Lyman-a! observations) and supernovae data. 

State-of-the-art applications of cosmological parameter inference can be found in papers such as [97, 
105, 142, 143]. Table [3] summarizes recent posterior credible intervals on the parameters of the vanilla 
ACDM model introduced above while Figure [6] shows the full 1— D posterior pdf for 6 relevant parameters 
(both from Ref. [105]). The initial conditions emerging from inflation are well described by one adiabatic 
degree of freedom and a distribution of fluctuations that deviates slightly from scale invariance, but which 
is otherwise fairly featureless. 

Addition of extra parameters to this basic description (for example, a curvature term, or a time-evolution 
of the cosmological constant, in which case it is generically called "dark energy") is best discussed in terms 
of model comparison rather than parameter inference (see next section). The breath and range of studies 
aiming at constraining extra parameters is such that it would be impossible to give even a rough sketch 
here. However we can say that the simple model described by the 6 cosmological parameters given in 
Table [3] appears at the moment appropriate and sufficient to explain most of the presently available data. 
MCMC is nowadays almost universally employed in one form or another in the cosmology community. 



1 Available from: cosmologist . inf o/cosmomc (accessed Jan 20th 2008). 
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Figure 6. Posterior constraints on key cosmological parameters from recent CMB and large scale structure data, compare Table [3] Top 
row, from left to right, posterior pdf (normalized to the peak) for the cosmological constant density in units of the critical density, the 
(physical) baryons and cold dark matter densities. Bottom row, from left to right: optical depth to reionization, scalar tilt and scalar 
fluctuations amplitude. Yellow using WMAP 1— yr data, orange WMAP 3-yr data and red adding Sloan Digital Sky Survey galaxy 
distribution data. Spatial flatness and adiabatic initial conditions have been assumed. This set of only 6 parameters (plus 2 other 
nuisance parameters not shown here) appear currently sufficient to describe most cosmological observations (adapted from [105]). 

5.3.4 Recent developments in parameter inference. Nowadays, the likelihood evaluation step is be- 
coming the bottleneck of cosmological parameter inference, as the WMAP likelihood code [80] requires 
the evaluation of fairly complex correlation terms, while the computational time for the actual model 
prediction in terms of power spectra is subdominant (except for spatially curved models or non-standard 
scenarios containing active seeds, such as cosmic strings). This trend is likely to become stronger with fu- 
ture CMB datasets, such as Planck. Currently, for relatively straightforward extensions of the concordance 
model presented in Table [3l a Markov chain with enough samples to give good inference can be obtained 
in a couple of days of computing time on a "off-the-shelf" machine with 4 CPUs. 

Massive savings in computational effort can be obtained by employing neural networks techniques, a 
computational methodology consistent of a network of processors called "neurons" . Neural networks learn 
in an unsupervised fashion the structure of the computation to be performed (for example, likelihood eval- 
uation across the cosmological parameter space) from a training set provided by the user. Once trained, the 
network becomes a fast and efficient interpolation algorithm for new points in the parameter space, or for 
classification purposes (for example to determine the redshift of galaxies from photometric data [144,145]). 
When applied to the problem of cosmological parameter inference, neural networks can teach themselves 
the structure of the parameter space (for models up to about 10 dimensions) by employing as little as a 
few thousands training points, for which the likelihood has to be computed numerically as usual. Once 
trained, the network can then interpolate extremely fast between samples to deliver a complete Markov 
chain within a few minutes. The speed-up can thus reach a factor of 30 or more [146, 147]. Another promis- 
ing tool is a machine-learning based algorithm called PICO [148, 1490, requiring a training set of order 
~ 10 4 samples, which are then used to perform an interpolation of the likelihood across parameter space. 
This procedure can achieve a speed-up of over a factor of 1000 with respect to conventional MCMC. 

The forefront of Bayesian parameter extraction is quickly moving on to tackle even more ambitious 
problems, involving thousands of parameters. This is the case for the Gibbs sampling technique to extract 
the full posterior probability distribution for the power spectrum Cgs directly from CMB maps, performing 
component separations (i.e., multi-frequency foregrounds removal) at the same time and fully propagating 
uncertainties to the level of cosmological parameters [150-153]. This method has been tested up to i < 50 
on WMAP temperature data and is expected to perform well up to i < 100 — 200 for Planck-quality data. 



1 Available from: http://cosmos.astro.uiuc.edu/pico/ (accessed Jan 14th 2008). 
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Equally impressive is the Hamiltonian sampling approach [154], which returns the C/s posterior pdf from 
a (previously foreground subtracted) CMB map. At WMAP resolution, this involves working with ~ 10 5 
parameters, but the efficiency is such that the 800 Ci distributions (for the temperature signal) can be 
obtained in about a day of computation on a high-end desktop computer. 

Another frontier of Bayesian methods is represented by high energy particle physics, which for historical 
and methodological reasons has been so far mostly dominated by frequentist techniques. However, the 
Bayesian approach to parameter inference for supersymmetric theories is rapidly gathering momentum, 
due to its superior handling of nuisance parameters, marginalization and inclusion of all sources of un- 
certainty. The use of Bayesian MCMC for supersymmetry parameter extraction has been first advocated 
in [155], and has then been rigorously applied to a detailed analysis of the Constrained Minimal Supersym- 
metric Standard Model [156-161], a problem that involves order 10 free parameters. A public code called 
SuperBayeS is available to perform an MCMC Bayesian analysis of supersymmetric mo delfl Presently, 
there are hints that the constraining power of the data is insufficient to override the prior choice in this 
context [162], but future observations, most notably by the Tevatron or LHC [163,164] and tighter limits 
(or a detection) on the neutralino scattering cross section [159], should considerably improve the situation 
in this respect. 

5.4 Caveats and common pitfalls 

Although Bayesian inference is quickly maturing to become an almost automated procedure, we should 
not forget that a "black box" approach to the problem always hides dangers and pitfalls. Every real world 
problem presents its own peculiarities that demand careful consideration and statistical inference remains 
very much a craft as much as a science. While Bayes' Theorem is never wrong, incorrect specification of the 
prior (for example, making unwarranted assumptions of failing to specify relevant external information) 
or inappropriate construction of the likelihood (e.g., not reflecting the experiment or neglecting relevant 
sources of uncertainty) will easily lead to wrong inferences. Below we list a few common pitfalls that need 
to be considered in the cosmological context. 

(i) Hidden prior information. Sometimes the choice of flat priors on the parameters is uncritically taken 
to be uninformative. This is often not the case. For instance, a flat prior is indeed non-informative if 
we are estimating the mean of a Gaussian, but if we are interested in its standard deviation a a prior 
which is flat in ln<7 ("Jeffreys' prior") is instead the appropriate choice (see e.g. [6]). When considering 
extensions of the ACDM model, for example, it is common practice to parameterize the new physics 
with a set of quantities about which very little (if anything at all) is known a priori. Assuming flat 
priors on those parameters is often an unwarranted choice. Flat priors in "fundamental" parameters will 
in general not correspond to a flat distribution for the observable quantities that are derived from those 
parameters, nor for other quantities we might be interested in constraining. Hence the apparently non- 
informative choice for the fundamental parameters is actually highly informative for other quantities 
that appear directly in the likelihood. It is extremely important that this "hidden prior information" 
be brought to light, or one could mistake the effect of the prior for constraining power of the data. An 
effective way to do that is to plot pdf's for the quantities of interest without including the data, i.e. 
run an MCMC sampling from the prior only (see [45] for an instructive example). 

(ii) No well— defined prior measure. Another difficulty is that in many cases several physically equally 
plausible parameterizations exist, in particular for problems involving unknown amplitudes, such as for 
example isocurvature modes in the initial conditions. Since a flat prior on parameterization A is not flat 
in parameterization B if the two are related by a non-linear transformation, see Eq. (jlip . two physically 
equivalent setups might lead to widely different inferences. In the absence of theoretical or physical 
reasons to prefer parameterization A or B this leads to the unsatisfactory dependence of the posterior on 
the volume enclosed by the prior, as discussed for the problem of general initial conditions in [120]. For 
an example of such a "prior volume effect" applied to inflationary models parameters, see [165]. Other 



1 Code available from: superbayes.org (accessed Feb 13th 2008). 
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obvious domains where this effect might be problematic are dark energy equation of state reconstruction 
(e.g., [166, 167]), initial power spectrum reconstruction [168-170] and determination of inflationary 
potential parameters [171-173]. In some special cases, fundamental principles can be invoked to define 
the appropriate prior measure, which exploits either symmetries or invariance properties of the problem. 
An important example is image reconstruction, where the Maximum Entropy principle is employed to 
define an informative prior on the image space (see e.g. [174-176] for an astronomical application). 

(iii) Lindley's paradox. A methodological issue is the widespread use of inappropriate tools to answer 
what is actually a model selection question. Often we want to assess the "significance" of an effect, when 
a deviation is observed in the posterior from the parameter value that would correspond to the absence 
of that effect. This situation is extremely common across very different domains, from estimation of 
photon number counts in presence of a background (see [18,19,177] for a proper Bayesian treatment), 
to the assessment of the anomalies in the large-scale CMB power spectrum ( [178] compare frequentist 
methods with the Bayesian evidence), the detection of gravitational waves ( [179] introduce a Bayesian 
technique that is automatically optimal) or the discovery of extra-solar planets [180]. In general, the 
"number of sigma" away from the expected value in the absence of a signal is not a good indicator 
of the significance of the effect, a result that goes under the name of "Lindley's paradox" [181]. For 
further details, see Appendix A in [58]. 

(iv) Using the data twice. Often the data are used to guide in a qualitative fashion the model building 
or the choice of priors. If the same data are then used for a quantitative comparison the significance 
of the effect can be drastically overestimated. This is a well-known problem in frequentist statistics, 
which therefore insists that one should design one's statistical tests before seeing the data at all. In 
cosmology this might often be problematic or impossible to achieve, as new observations will uncover 
completely unexpected phenomena (for example, the large-scale anomalies in the cosmic microwave 
background). It is however important to keep in mind that the significance of such effects is difficult 
to assess. 



6 Cosmological Bayesian model building 

The Bayesian model comparison approach based on the evaluation of the evidence is being increasingly 
applied to model building questions such as: are isocurvature contributions to the initial conditions required 
by the data [46,58,60,182]? Is the Universe flat [58,70,183]? What is the best description of the primordial 
power spectrum for density perturbations [47,51,58,70,165,184,185]? Is dark energy best described as a 
cosmological constant [50,51, 186-191]? In this section we review the status of the field. 

6.1 Evidence for the cosmological concordance model 

Table |4] is a fairly extensive compilation of recent results regarding possible extensions to (or reduction 
of) the vanilla ACDM concordance cosmological model introduced in section 15.11 We have chosen to 
compile only results obtained using the full Bayesian evidence, rather than approximate model comparisons 
obtained via the information criteria because the latter are often not adequate approximations, for the 
reasons explained in section 14.71 Of course, the outcome depends on the Occam's razor effect brought 
about by the prior volume (and sometimes, by the choice of parameterization). Where applicable, we have 
show the sensitivity of the result on the prior assumptions by giving a ballpark range of values for the 
Bayes factor, as presented in the original studies. The reader ought to refer to the original works for the 
precise prior and parameter choices and for the justification of the assumed prior ranges. 

As anticipated, the 6 parameters ACDM concordance model is currently well supported by the data, 
as the inclusion of extra parameters is not required by the Bayesian evidence. This is shown by the 
fact that most model comparisons return either an undecided result or they support the ACDM model 
(negative values for In B in Table H|) . The only exception is the support for a cut-off on large scales in the 
power spectrum reported by [185]. This is clearly driven by the anomalies in the large scale CMB power 
spectrum, which in this case are interpreted as being a reflection of a lack of power in the primordial power 
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Table 4. Summary of model comparison results against the ACDM concordance model (sec Table [3j using Baycsian model comparison for nested 
models. A negative (positive) value for In B indicates that the competing model is disfavoured (supported) with respect to the ACDM model. The 
column A7V par gives the difference in the number of free parameters with respect to the ACDM concordance model. A negative value means that one 
of the parameters has been fixed. Sec references for full details and in particular for the choice of priors on the model parameters, which control the 
strengths of the Occam's razor effect. 



Competing model 


AiVpar 


ln£ 


Ref 


Data 


Outcome 


Initial conditions 












Isocurvature modes 












CDM isocurvature 


+ 1 


-7.6 


[58] 


WMAP3+, LSS 


Strong evidence for adiabaticity 


+ arbitrary correlations 


+4 


-f.O 


[46] 


WMAP1+, LSS, SN la 


Undecided 


Neutrino entropy 


+1 


[-2.5, -6.5] p 


[60] 


WMAP3+, LSS 


Moderate to strong evidence for adiabaticity 


+ arbitrary correlations 


+4 


-1.0 


[46] 


WMAP1+, LSS, SN la 


Undecided 


Neutrino velocity 


+ 1 


[-2.5, -6.5] p 


[60] 


WMAP3+, LSS 


Moderate to strong evidence for adiabaticity 


+ arbitrary correlations 


+4 


— 1.0 


[16] 


WMAP1+, LSS, SN la 


Undecided 


Primordial power spect: 


rum 










No tilt (n s = 1) 


-1 


+0.4 


[17] 

L J 


WMAP1+, LSS 


Undecided 






[-1.1, -0.6l p 

L ' J 


[51] 


WMAP1+, LSS 


Undecided 






-0.7 


[58] 


WMAP1+, LSS 


Undecided 






-0.9 


[70] 


WMAP1+ 


Undecided 






[— U.7, — 1.7J^' 


[185] 


WMA.ro+ 


n 8 = 1 weakly disfavoured 






—2.0 


[184] 


WMAP3+, LSS 


n s = 1 weakly disfavoured 






—2.6 


I' "J 


WMAP3+ 


n s = 1 moderately disfavoured 






—2.9 


[581 


WMAP34- LSS 


Tig ~~ 1 moderately disfavoured 






< -3.9 C 


[65] 


WMAP3+, LSS 


Moderate evidence at best against n s ^ 1 


Running 


+1 


[-0.6, 1.0]P' d 


[185] 


WMAP3+, LSS 


No evidence for running 






< 0.2 C 


[165] 


WMAP3+, LSS 


Running not required 


Running of running 


+2 


< 0.4 C 


[165] 


WMAP3+, LSS 


Not required 


Large scales cut— off 


+2 


[1.3,2.2]P> d 


[185] 


WMAP3+, LSS 


Weak support for a cut— off 


Matter— energy content 












Non— flat Universe 


+1 


-3.8 


[70] 


WMAP3+, HST 


Flat Universe moderately favoured 






-3.4 


[58] 


WMAP3+, LSS, HST 


Flat Universe moderately favoured 


Coupled neutrinos 


+ 1 


-0.7 


[192] 


WMAP3+, LSS 


No evidence for non— SM neutrinos 


Dark energy sector 












w(z) = w eS ^ -1 


+1 


[-1.3, -2.7] p 


[186] 


SN la 


Weak to moderate support for A 






—3.0 


[50] 


SN la 


Moderate support tor A 






— 1.1 


r i — 1 1 
[51] 


WMArl+, LSS, la 


Weak support for A 






[-0.2, -IIP 


[187] 

L 1 J 


SN la, BAO, WMAP3 


Undecided 






[-1.6, -2.3] d 


[188] 


SN la, GRB 


Weak support for A 


w(z) = W() + W\Z 


+2 


[-1.5, -3.4]p 


[186] 


SN la 


Weak to moderate support for A 






-6.0 


[50] 


SN la 


Strong support for A 






-1.8 


[187] 


SN la, BAO, WMAP3 


Weak support for A 


w(z) = wo + w a (l — a) 


+2 


-1.1 


[187] 


SN la, BAO, WMAP3 


Weak support for A 






[-1.2, -2.6] d 


[188] 


SN la, GRB 


Weak to moderate support for A 


Reionization history 












No reionization (t = 0) 


-1 


-2.6 


[70] 


WMAP3+, HST 


r ^ moderately favoured 


No reionization and no tilt 


-2 


-10.3 


[70] 


WMAP3+, HST 


Strongly disfavoured 



d Depending on the choice of datasets. 



p Depending on the choice of priors. 

c Upper bound using Bayesian calibrated p— values, see section [4.51 

Data sets: WMAP1+ (WMAP3+): WMAP 1st year (3-yr) data and other CMB measurements. LSS: Large scale structures data. SN la: 
supernovae type la. BAO: baryonic acoustic oscillations. GRB: gamma ray bursts. 



spectrum. Whether such anomalies are of cosmological origin remains however an open question [193,194]. 
If extensions of the model are not supported, reduction of ACDM to simpler models is not viable, either: 
recent studies employing WMAP 3-yr data find that a scale invariant spectrum with no spectral tilt is 
now weakly to moderately disfavoured [58,65,70,184]. Also, a Universe with no reionization is no longer 
a good description of CMB data, and a non-zero optical depth r is indeed required [70] . 
A few further comments about the results reported in Table H] are in place: 

(i) Regarding the type of initial conditions for cosmological perturbations, all parameter extraction studies 
to date (with the exception of [195]) find that a purely adiabatic mode is in agreement with observations, 
and constrain the isocurvature fraction to be below about 10% for one single isocurvature mode at 
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the time [60] and below about 50% for a general mixture of modes [196,197]. Prom a model selection 
perspective, this means that we expect the purely adiabatic model to be preferred over a more complex 
model with a mixture of isocurvature modes. This is indeed the case, but the result is strongly dependent 
on the parameterization adopted for the isocurvature sector, which determines the strength of the 
Occam's razor effect. This is a consequence of the difficulty of coming up with a well motivated 
phenomenological parameterization of the isocurvature amplitudes, see the discussion in [58,182]. 

(ii) The shape of the primordial power spectrum has attracted considerable attention from a model compar- 
ison perspective [47,51,58,65,70,165,184,185]. With the exception of the large scale cut-off mentioned 
above, the current consensus appears to be that a power-law distribution of fluctuations, with power 
spectrum P(k) = Po(k/ko) nB ~ 1 with n s < 1 is currently the best description. This is usually interpreted 
as evidence for inflation. However, a proper model comparison of inflationary predictions involves in- 
cluding the presence of tensor modes generated by gravitational waves, parameterized in terms of their 
amplitude parameter r. Including this extra parameter runs into the difficulty of specifying its prior 
volume, as the two obvious choices of priors flat in r or logr lead to very different model comparison 
results [184]. Another problem is that the comparison might be ill-defined, as the simpler model with 
n s = 1 and r = is presumably some sort of alternative, unspecified model without inflation that 
would not solve the horizon problem. On this ground alone, unless an alternative solution to the hori- 
zon problem is put on the table, such an alternative model would be immediately thrown out (see the 
discussion in [165]). Finally, higher-order terms in the Taylor expansion of the power spectrum, such 
as a running of the spectral index or a running of the running, are currently not required. This appears 
a robust result with respect to a wide choice of priors and data sets. 

(iii) Present-day constraints on the curvature of spatial sections are of order |f2 K | < 0.01 (with £l K = 
corresponding to a flat, Puclidean geometry), stemming from a combination of CMB, large scale 
structures and supernovae data. Choosing a phenomenological prior of width Af2 K = 1 around 
delivers a moderate support for a flat Universe versus curved models [58,70]. However, adopting an 
inflation-motivated prior instead, A£2 K ~ 10 -5 , would lead to an undecided result (InB = 0) for the 
model comparison, as the data are not strong enough to discriminate between the two models in this 
case. This can be formalized by considering the Bayesian model complexity for the two choices of priors, 
Pq. (|35p . Noticing that cr/X is the ratio between the likelihood and prior widths, for a prior on the 
curvature parameter of width 1, cj/S ~ 10~ 2 and C ~ 1, hence the parameter has been measured. But 
if we take a prior width ~ 10~ 5 , cr/S ~ 10 3 hence C b -> 0. In the latter case, we can see from Pq. (|2ip 
that the Bayes factor between the two models -Boi — ► 1 and the evidence is inconclusive, awaiting 
better data. 

(iv) Model comparisons regarding the dark energy sector suffer from considerably uncertainty. Clearly, the 
model to beat is the cosmological constant (with equation of state parameter w = —1 at all redshifts), 
but alternative dark energy scenarios suffer from the fundamental difficulty of motivating physically 
both the parameterization of the dark energy time dependence and the prior volume for the extra 
parameters [198] (see [199] for a review of models and [200] for on overview of recent constraints). 
However, the semi-phenomenological studies shown in Table H] do agree in deeming a cosmological 
constant a sufficient description of the data. This is again a consequence of the fact that no time 
evolution of the equation of state is detected in the data, hence the strength of the support in favour of 
the cosmological constant becomes a function of the available parameter space under the more complex, 
alternative models. Given this result, it is interesting to ask what level of accuracy is required before 
our degree of belief in the cosmological constant is overwhelmingly larger than for an evolving dark 
energy, assuming of course that future data will not detect any significant departure from w = — 1. To 
this end, a simple classification of models has been given in [201] in terms of their effective equation of 
space parameter, w c s, representing the time-varying equation of state averaged over redshift with the 
appropriate weighting factor for the observable [202]. The three categories considered are "phantom 
models" (exhibiting large, negative values for the equation of state, —11 < u> e ff < —1), "fluid-like dark 
energy" (—1 < w c s < —1/3) and "small-departures from A" models (—1.01 < w e s < 0.99). Assuming 
a flat prior on these ranges of values for w e s, consideration of the Bayes factor between each of those 
models and the cosmological constant shows that gathering strong evidence against each of the models 
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Tabic 5. Summary of model comparison results against the ACDM concordance model for some alternative (i.e., non— nested) cosmological 
models. A negative (positive) value for In B indicates that the competing model is disfavoured (supported) with respect to ACDM. The 
column 7V par gives the number of free parameters in the alternative model. See references for full details about the models, priors and data 
used. 



Competing model N p!lT 


InB 


Ref 


Data 


Outcome 


Alternatives to FRW 

Bianchi VII h 5 to 8 
5 to 6 


[-0.9, 1.2} d 'P 
[-0.1,-1.2]*' 


[54] 
[203] 


WMAP1, WMAP3 
WMAP3 


Weak support (at best) for Bianchi template 
No evidence after texture correction 


LTB models 4 


-3.6 


[204] 


WMAP3, BAO, SN la 


Moderate evidence against LTB 


Fractal bubble model 2 


0.3 


[89] 


SN la 


Undecided 


Asymmetry in the CMB 

Anomalous dipole 3 


1.8 

< 2.2 C 


[205] 
[65] 


WMAP3 
WMAP3 


Weak evidence for anomalous dipole 
Weak evidence at best 



d Depending on the choice of datasets. 
p Depending on the choice of priors. 

c Upper bound using Bayesian calibrated p-values, see section \A. 51 



requires an accuracy on w e s of order o~ c s = 0.05 for phantom models (which are therefore already under 
pressure from current data, which have an accuracy of order ~ 0.1), <7 c ff = 3 x 10 -3 for fluid-like models 
(about a factor of 5 better than optimistic constraints from future observations) and cr e ff = 5 x 10~ 5 
for small-departure models. Refinements of this approach that employ more fundamentally-motivated 
priors could lead to an analysis of the expected costs/benefits from future dark energy observations in 
terms of their likely model selection outcome (we return on this issue in section [6^2]) . 

Let us now turn to models that are not nested within ACDM — i.e., alternative theoretical scenarios. 
Table [5] gives some examples of the outcome of the Bayesian model comparison with the concordance 
model. As above, we restrict our considerations to studies employing the full Bayesian evidence (there 
are many other examples in the literature carrying out approximate model comparison using information 
criteria instead). The model comparison is often more difficult for non-nested models, as priors must be 
specified for all of the parameters in the alternative model (and in the ACDM model, as well), in order 
to compute the evidence ratio. The usual caveats on prior choice apply in this case. From Table [5] it 
appears that the data do not seem to require fundamental changes in our underlying theoretical model, 
either in the form of Bianchi templates representing a violation of cosmic isotropy (see also [206]), or as 
Lemaitre-Tolman-Bondi models or fractal bubble scenarios with dressed cosmological parameters. The 
anomalous dipole in the CMB temperature maps is a fine example of Lindley's paradox. When fitting a 
dipolar template to the CMB maps, the effective chi-square improves by 9 to 11 units (depending on the 
details of the analysis) for 3 extra parameters [205,207], which would be deemed a "significant" effect 
using a standard goodness-of-fit test. However, the Bayesian evidence analysis shows that the odds in 
favour of an anomalous dipole are 9 to 1 at best (corresponding to InB < 2.2), which does not reach 
the "moderate evidence at best" threshold. Hence Bayesian model comparison is conservative, requiring a 
stronger evidence before deeming an effect to be favoured. 

6.2 Other uses of the Bayesian evidence 

Beside cosmological model building, the Bayesian evidence can be employed in many other different ways. 
Here we presents two aspects that are relevant to our topic, namely the applications to the field of multi- 
model inference and model selection forecasting. 

(i) Multi model inference. Once we realize that there are several possible models for our data, it 
becomes interesting to present parameter inferences that take into account the model uncertainty 
associated with this plurality of possibilities. In other words, instead of just constraining parameters 
within each model, we can take a step further and produce parameter inferences that are averaged 
over the models being considered. Let us suppose that we have a minimal model (in our case, ACDM) 
and a series of augmented models with extra parameters. A typical example from cosmology is dark 
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energy (first discussed in the context of multi-model averaging in [208]), where the minimal model 
has w = —1 fixed and there are several other candidate models with a time- varying equation of 
state, parameterized in terms of a number of free parameters and their priors. Let us denote by 6 the 
cosmological parameters common to all models. For the extended models, the redshift-dependence of 
the dark energy equation of state is described by a vector of parameters Ui (under model Mi). The 
ACDM model has no free parameters for the equation of state, hence the prior on wacdm is a delta 
function centered on w(z) = wq = —1. Then a straightforward application of Bayes' theorem leads to 
the following posterior distribution for the parameters: 



where p(9, u\d, Mi) is the posterior within each model Mi, and it is understood that the posterior 
has non-zero support only along the parameter directions Cu that are relevant for the model, and 
delta-functions along all other directions. Each term is weighted by the corresponding posterior model 
probability, 



The prior model probabilities p(Mi) are usually set equal, but a model preference can be incorporated 
here if necessary. The model averaged posterior distribution of Eq. (|48[) then represents the parameter 
constraints obtained independently of the model choice, which has been marginalized over. Unless one 
of the models is overwhelmingly more probable than the others (in which case the model averaging 
essentially disappears, as all of the weights for the other models go to zero), the model-averaged 
posterior distribution can be significantly different from the model-specific distribution. A counter- 
intuitive consequence is that in the case of dark energy, the model-averaged posterior shows tighter 
constraints around w = — 1 than any of the evolving dark energy models by itself. This comes about 
because ACDM is the preferred model and hence much of the weight in the model-averaged posterior 
is shifted to the point w = —1 [208]. For further details on multi-model inference, see e.g. [209,210]. 
Model selection forecasting. When considering the capabilities of future experiments, it is com- 
mon stance to predict their performance in terms of constraints on relevant parameters, assuming a 
fiducial point in parameter space as the true model (often, the current best-fit model). While this is 
a useful indicator for parameter inference tasks, many questions in cosmology fall rather in the model 
comparison category. A notable example is again dark energy, where the science driver for many future 
multi-million-dollar probes is to detect possible departures from a cosmological constant, hence to 
gather evidence in favour of an evolving dark energy model. It is therefore preferable to assess the 
capabilities of future experiments by their ability to answer model selection questions. 

The procedure is as follows (see [211] for details and the application to dark energy scenarios). At 
every point in parameter space, mock data from the future observation are generated and the Bayes 
factor between the competing models is computed, for example between an evolving dark energy and a 
cosmological constant. Then one delimits in parameter space the region where the future data would not 
be able to deliver a clear model comparison verdict, for example | In B\ < 5 (evidence falling short of the 
"strong" threshold). The experiment with the smallest "model-confusion" volume in parameter space 
is to be preferred, since it achieves the highest discriminative power between models. An application 
of a related technique to the spectral index from the Planck satellite is presented in [212,213]. 

Alternatively, we can investigate the full probability distribution for the Bayes factor from a future 
observation. This allows to make probabilistic statements regarding the outcome of a future model 
comparison, and in particular to quantify the probability that a new observation will be able to achieve 
a certain level of evidence for one of the models, given current knowledge. This technique is based on 
the predictive distribution for a future observation, which gives the expected posterior on 6 for an 
observation with experimental capabilities described by e (this might describe sky coverage, noise 




(48) 




(49) 



levels, target redshift, etc): 
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p(6\e,d) = J>(M|cO / dil>tpWhe,Mi)pWt\d,Mi). (50) 

Here, d are the currently available observations, p(Mi\d) is the current model posterior, p(8\ijj*, e, Mi) is 
the posterior on from a future observation e computed assuming ip* are the correct model parameters, 
while each term is weighted by the present probability that is the true value of the parameters, 
p(ip*\d, Mi). The sum over i ensures that the prediction averages over models, as well. From Eq. (|50p 
we can compute the corresponding probability distribution for In I? from experiment e, for example 
by employing MCMC techniques (further details are given in [59]). This method is called PPOD, 
for predictive posterior odds distribution and can be useful in the context of experiment design and 
optimization, when the aim is to determine which choice of e will lead to the best scientific return from 
the experiment, in this case in terms of model selection capabilities (see [214-216] for a discussion of 
performance optimization for parameter constraints). For further details on Bayes factor forecasts and 
experiment design, see [217]. 



7 Conclusions 

Bayesian probability theory offers a consistent framework to deal with uncertainty in several different 
situations, from parameter inference to model comparison, from prediction to optimization. The notion of 
probability as a degree of belief is far more general than the restricted view of probability as frequency, and 
it can be applied equally well both to repeatable experiments and to one-off situations. We have seen how 
Bayes' theorem is a unique prescription to update our state of knowledge in the light of the available data, 
and how the basic laws of probability can be used to incorporate all sorts of uncertainty in our inferences, 
including noise (measurement uncertainty), systematic errors (hyper-parameters), imperfect knowledge of 
the system (nuisance parameters) and modelling uncertainty (model comparison and model averaging). 
The same laws can equally well be applied to the problem of prediction, and there is considerable potential 
for a systematic exploration of experiment optimization and Bayesian decision theory (e.g., given what we 
know about the Universe and our theoretical models, what are the best observations to achieve a certain 
scientific goal?). 

The exploration of the full potential of Bayesian methods is only just beginning. Thanks to the increasing 
availability of cheap computational power, it now becomes possible to handle problems that were of 
intractable complexity until a few years ago. Markov Chain Monte Carlo techniques are nowadays a 
standard inference tool to derive parameter constraints, and many algorithms are available to explore 
the posterior pdf in a variety of settings. We have highlighted how the issue of priors — which has 
traditionally been held against Bayesian methods — is a false problem stemming from a misunderstanding 
of what Bayes' theorem says. This is not to deny that it can be difficult in practice to choose a prior that 
is a fair representation of one's degree of belief. But we should not shy away from this task — the fact 
is, there is no inference without assumptions and a correct application of Bayes' theorem forces us to be 
absolutely clear about which assumptions we are making. It remains important to quantify as much as 
possible the extent by which our priors are influencing our results, since in many cases when working at 
the cutting-edge of research we might not have the luxury of being in a data-dominated regime. 

The model comparison approach can formalize in a quantitative manner the intuitive assessment of 
scientific theories, based on the Occam's razor notion that simpler models ought to be preferred if they 
offer a satisfactory explanation for the observations. The Bayesian evidence and complexity tell us which 
models are supported by the data, and what is their effective number of parameters. Multi-model inference 
delivers model-averaged parameter constraints, thus merging the two levels of parameter inference and 
model comparison. 

The application of Bayesian tools to cosmology and astrophysics is blossoming. As both data sets and 
models become more complex, our inference tools must acquire a corresponding level of sophistication, as 
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basic statistical analyses that served us well in the past are no longer up to the task. There is little doubt 
that the field of cosmostatistics will grow in importance in the future, and Bayesian methods will have a 
great role to play. 
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